Understanding HRV: What Drives Heart Rate Variability?

Author

Camilo Martinez

Published

October 10, 2025

Project Overview

Research Question

What aspects of my daily routine lead to high or low HRV (Heart Rate Variability)?

This project uses inference rather than pure prediction to understand the underlying mechanisms that drive HRV variability. Our goal is to identify which lifestyle factors, sleep metrics, and workout characteristics have the strongest causal relationships with next-day HRV.

Data Sources

All data comes from the WHOOP API, aggregated into a daily feature table with one row per day. Each row represents:

  • Target Variable: hrv_rmssd_ms (HRV in milliseconds)
  • Recovery Metrics: Resting heart rate, SpO2, skin temperature
  • Sleep Metrics: Sleep stages, efficiency, disturbances, respiratory rate
  • Workout Metrics: Total workout time, heart rate zones, average workout intensity

Methodology

This analysis follows a rigorous data science workflow:

  1. Phase 1: Deep Exploratory Data Analysis (EDA)
    • Visualize target variable over time
    • Analyze distributions and detect outliers
    • Perform bivariate analysis
  2. Phase 2: Feature Engineering & Selection
  3. Phase 3: Statistical Modeling & Inference
  4. Phase 4: Validation & Interpretation

Executive Summary: Key Findings

This analysis successfully built a machine learning model capable of explaining a significant portion of daily HRV variance. The key takeaway is that HRV is less about single-day actions and more about the body’s underlying physiological state and accumulated load.

The three most powerful drivers identified are:

  1. Resting Heart Rate (RHR): A low RHR is the single strongest predictor of a high HRV, indicating a state of recovery and readiness.
  2. Acute Stress (RHR Deviation): A sudden spike in RHR compared to its 7-day average is a powerful negative driver, signaling acute stress, illness, or poor recovery.
  3. Chronic Fitness Base: Long-term cumulative workout minutes positively correlate with a higher baseline HRV, demonstrating that a consistent training load builds a more resilient system.

These insights form the basis for a personal, data-driven protocol to optimize daily performance and recovery.


Setup & Data Loading

Import libraries and configure environment
# Data manipulation
import pandas as pd
import numpy as np
from datetime import datetime, timedelta

# Database connection
import psycopg2
from sqlalchemy import create_engine
import os
from dotenv import load_dotenv

# Visualization
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots

# Statistical analysis
from scipy import stats
from scipy.stats import pearsonr, spearmanr

# Configure plotting style
plt.style.use('dark_background')
sns.set_palette("husl")
plt.rcParams['figure.figsize'] = (8, 4.5)
plt.rcParams['font.size'] = 9

# Load environment variables
load_dotenv()

print("✅ Libraries loaded successfully")
✅ Libraries loaded successfully

We preceed to load the data from a PostgreSQL database. The SQL query aggregates daily features from multiple tables, ensuring one row per day with relevant metrics.

Load HRV feature data from PostgreSQL
# Database connection
DATABASE_URL = os.getenv('DATABASE_URL')

# Fix for SQLAlchemy: Replace 'postgres://' with 'postgresql://'
if DATABASE_URL and DATABASE_URL.startswith('postgres://'):
    DATABASE_URL = DATABASE_URL.replace('postgres://', 'postgresql://', 1)

# Create SQLAlchemy engine
engine = create_engine(DATABASE_URL)

# SQL query to fetch daily aggregated features
query = """
-- CTE 1: Pre-aggregates daily workout data
WITH workouts_daily_summary AS (
    SELECT
        DATE(start_time) AS workout_date,
        SUM(EXTRACT(EPOCH FROM (end_time - start_time)) / 60) AS total_workout_minutes,
        AVG(avg_heart_rate_bpm) AS avg_workout_heart_rate_bpm,
        SUM(hr_zone_0_ms) AS total_hr_zone_0_ms,
        SUM(hr_zone_1_ms) AS total_hr_zone_1_ms,
        SUM(hr_zone_2_ms) AS total_hr_zone_2_ms,
        SUM(hr_zone_3_ms) AS total_hr_zone_3_ms,
        SUM(hr_zone_4_ms) AS total_hr_zone_4_ms,
        SUM(hr_zone_5_ms) AS total_hr_zone_5_ms
    FROM
        whoop_workouts
    WHERE
        sport_name NOT IN ('sauna', 'meditation')
    GROUP BY
        1
),

-- CTE 2: Aggregates ALL cycles for a given day (sleep + naps)
cycles_daily_aggregates AS (
    SELECT
        DATE(start_time) AS cycle_date,
        -- Duration-weighted average heart rate for the entire day
        SUM(avg_heart_rate_bpm * EXTRACT(EPOCH FROM (end_time - start_time))) / 
            SUM(EXTRACT(EPOCH FROM (end_time - start_time))) AS cycle_avg_hr_bpm,
        -- Maximum heart rate across all cycles for that day
        MAX(max_heart_rate_bpm) AS cycle_max_hr_bpm
    FROM
        whoop_cycles
    GROUP BY
        1
),

-- CTE 3: Identifies the primary (longest) sleep cycle for each day
main_sleep_cycle AS (
    SELECT
        id,
        start_time,
        end_time,
        -- Rank cycles by duration - longest sleep is rank 1
        ROW_NUMBER() OVER(
            PARTITION BY DATE(start_time) 
            ORDER BY (end_time - start_time) DESC
        ) as sleep_rank
    FROM
        whoop_cycles
)

-- Final query: One row per day with all features
SELECT
    DATE(main.start_time) AS cycle_date,

    -- TARGET VARIABLE
    r.hrv_rmssd_ms,

    -- Recovery Features
    r.resting_heart_rate_bpm,
    r.spo2_percentage,
    r.skin_temp_celsius,

    -- Cycle Features (accounts for naps)
    c_agg.cycle_avg_hr_bpm,
    c_agg.cycle_max_hr_bpm,

    -- Sleep Features
    s.respiratory_rate,
    s.sleep_consistency_percentage,
    s.sleep_efficiency_percentage,
    s.total_in_bed_time_ms,
    s.total_awake_time_ms,
    s.total_light_sleep_time_ms,
    s.total_slow_wave_sleep_time_ms,
    s.total_rem_sleep_time_ms,
    s.disturbance_count,

    -- Workout Features (aggregated daily)
    COALESCE(wds.total_workout_minutes, 0) AS total_workout_minutes,
    COALESCE(wds.avg_workout_heart_rate_bpm, 0) AS avg_workout_hr_bpm,
    COALESCE(wds.total_hr_zone_0_ms, 0) AS total_hr_zone_0_ms,
    COALESCE(wds.total_hr_zone_1_ms, 0) AS total_hr_zone_1_ms,
    COALESCE(wds.total_hr_zone_2_ms, 0) AS total_hr_zone_2_ms,
    COALESCE(wds.total_hr_zone_3_ms, 0) AS total_hr_zone_3_ms,
    COALESCE(wds.total_hr_zone_4_ms, 0) AS total_hr_zone_4_ms,
    COALESCE(wds.total_hr_zone_5_ms, 0) AS total_hr_zone_5_ms

FROM
    main_sleep_cycle AS main
LEFT JOIN
    cycles_daily_aggregates AS c_agg ON DATE(main.start_time) = c_agg.cycle_date
LEFT JOIN
    whoop_recovery AS r ON main.id = r.cycle_id
LEFT JOIN
    whoop_sleep AS s ON main.id = s.cycle_id
LEFT JOIN
    workouts_daily_summary AS wds ON DATE(main.start_time) = wds.workout_date
WHERE
    main.sleep_rank = 1  -- One row per day
ORDER BY
    cycle_date DESC;
"""

# Load data
df = pd.read_sql_query(query, engine)
# Ensure datetimelike dtype for downstream .dt usage
df['cycle_date'] = pd.to_datetime(df['cycle_date'], errors='coerce')

print(f"✅ Data loaded successfully")
print(f"📊 Dataset shape: {df.shape[0]} days × {df.shape[1]} features")
print(f"📅 Date range: {df['cycle_date'].min()} to {df['cycle_date'].max()}")
✅ Data loaded successfully
📊 Dataset shape: 260 days × 24 features
📅 Date range: 2025-01-14 00:00:00 to 2025-10-08 00:00:00
Display dataset structure and summary statistics
# Display first few rows
print("\n" + "="*80)
print("DATASET PREVIEW")
print("="*80)
display(df.head(10))

# Display data types
print("\n" + "="*80)
print("DATA TYPES")
print("="*80)
print(df.dtypes)

# Basic statistics
print("\n" + "="*80)
print("SUMMARY STATISTICS")
print("="*80)
display(df.describe())

# Missing values
print("\n" + "="*80)
print("MISSING VALUES")
print("="*80)
missing = df.isnull().sum()
missing_pct = (missing / len(df) * 100).round(2)
missing_df = pd.DataFrame({
    'Missing Count': missing,
    'Percentage': missing_pct
}).sort_values('Missing Count', ascending=False)
print(missing_df[missing_df['Missing Count'] > 0])

================================================================================
DATASET PREVIEW
================================================================================
cycle_date hrv_rmssd_ms resting_heart_rate_bpm spo2_percentage skin_temp_celsius cycle_avg_hr_bpm cycle_max_hr_bpm respiratory_rate sleep_consistency_percentage sleep_efficiency_percentage ... total_rem_sleep_time_ms disturbance_count total_workout_minutes avg_workout_hr_bpm total_hr_zone_0_ms total_hr_zone_1_ms total_hr_zone_2_ms total_hr_zone_3_ms total_hr_zone_4_ms total_hr_zone_5_ms
0 2025-10-08 78.3485 43.0 99.13 33.87 57.0 161.0 14.56 65.0 95.69 ... 5646140 11 44.964733 107.000000 807990.0 1122010.0 535980.0 225000.0 7000.0 0.0
1 2025-10-07 69.8866 45.0 97.62 34.71 54.0 140.0 14.59 71.0 92.79 ... 6528290 9 62.460283 93.666667 1713980.0 1974980.0 57000.0 0.0 0.0 0.0
2 2025-10-06 59.3735 49.0 97.18 33.51 60.0 175.0 14.65 59.0 93.50 ... 6994190 11 45.515850 124.000000 150000.0 1364990.0 548990.0 443000.0 219000.0 6000.0
3 2025-10-05 64.3845 44.0 97.83 34.14 74.0 185.0 14.60 45.0 93.86 ... 6066210 11 75.781383 149.000000 278990.0 297000.0 419000.0 1244990.0 1822980.0 483990.0
4 2025-10-04 65.3048 48.0 97.71 34.04 60.0 135.0 15.41 41.0 94.69 ... 10119310 16 10.477950 110.000000 24000.0 602990.0 2000.0 0.0 0.0 0.0
5 2025-10-03 73.1262 43.0 97.33 33.47 65.0 141.0 15.35 73.0 93.30 ... 5974260 10 40.596733 92.000000 1412990.0 876990.0 146000.0 0.0 0.0 0.0
6 2025-10-02 100.8779 42.0 97.75 34.38 50.0 118.0 14.68 60.0 95.00 ... 3729120 8 0.000000 0.000000 0.0 0.0 0.0 0.0 0.0 0.0
7 2025-10-01 68.3601 45.0 97.18 33.67 55.0 122.0 15.06 69.0 93.15 ... 6367170 6 7.161350 87.000000 339990.0 90000.0 0.0 0.0 0.0 0.0
8 2025-09-30 55.3180 49.0 98.00 33.53 63.0 142.0 16.70 67.0 80.08 ... 5286060 10 22.557033 106.000000 453000.0 636990.0 263000.0 0.0 0.0 0.0
9 2025-09-29 46.3208 58.0 98.00 34.38 63.0 127.0 16.88 53.0 90.34 ... 12280390 24 0.000000 0.000000 0.0 0.0 0.0 0.0 0.0 0.0

10 rows × 24 columns


================================================================================
DATA TYPES
================================================================================
cycle_date                       datetime64[ns]
hrv_rmssd_ms                            float64
resting_heart_rate_bpm                  float64
spo2_percentage                         float64
skin_temp_celsius                       float64
cycle_avg_hr_bpm                        float64
cycle_max_hr_bpm                        float64
respiratory_rate                        float64
sleep_consistency_percentage            float64
sleep_efficiency_percentage             float64
total_in_bed_time_ms                      int64
total_awake_time_ms                       int64
total_light_sleep_time_ms                 int64
total_slow_wave_sleep_time_ms             int64
total_rem_sleep_time_ms                   int64
disturbance_count                         int64
total_workout_minutes                   float64
avg_workout_hr_bpm                      float64
total_hr_zone_0_ms                      float64
total_hr_zone_1_ms                      float64
total_hr_zone_2_ms                      float64
total_hr_zone_3_ms                      float64
total_hr_zone_4_ms                      float64
total_hr_zone_5_ms                      float64
dtype: object

================================================================================
SUMMARY STATISTICS
================================================================================
cycle_date hrv_rmssd_ms resting_heart_rate_bpm spo2_percentage skin_temp_celsius cycle_avg_hr_bpm cycle_max_hr_bpm respiratory_rate sleep_consistency_percentage sleep_efficiency_percentage ... total_rem_sleep_time_ms disturbance_count total_workout_minutes avg_workout_hr_bpm total_hr_zone_0_ms total_hr_zone_1_ms total_hr_zone_2_ms total_hr_zone_3_ms total_hr_zone_4_ms total_hr_zone_5_ms
count 260 260.000000 260.000000 260.000000 255.000000 260.000000 260.000000 260.000000 260.000000 260.000000 ... 2.600000e+02 260.000000 260.000000 260.000000 2.600000e+02 2.600000e+02 2.600000e+02 2.600000e+02 2.600000e+02 260.000000
mean 2025-05-27 13:50:46.153846272 66.697452 46.361538 96.775038 33.892392 60.719231 150.073077 14.556346 69.130769 92.733192 ... 5.826298e+06 10.484615 34.298576 64.340449 9.273339e+05 6.622991e+05 1.629445e+05 1.407983e+05 1.141891e+05 16122.857692
min 2025-01-14 00:00:00 37.835900 40.000000 91.730000 30.200000 48.000000 97.000000 13.360000 0.000000 74.130000 ... 0.000000e+00 2.000000 0.000000 0.000000 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 0.000000
25% 2025-03-21 18:00:00 59.609950 44.000000 96.207500 33.620000 57.000000 137.000000 14.180000 63.750000 90.477500 ... 4.862942e+06 8.000000 0.000000 0.000000 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 0.000000
50% 2025-05-27 12:00:00 65.931450 46.000000 96.825000 33.940000 60.000000 150.000000 14.410000 71.000000 93.290000 ... 5.838530e+06 10.000000 26.734258 87.750000 2.152050e+05 3.465000e+05 4.500000e+03 0.000000e+00 0.000000e+00 0.000000
75% 2025-08-02 06:00:00 72.779225 48.000000 97.532500 34.200000 64.000000 162.000000 14.770000 79.000000 95.402500 ... 6.812656e+06 13.000000 50.072104 106.125000 1.232423e+06 1.153954e+06 2.254830e+05 1.122342e+05 0.000000e+00 0.000000
max 2025-10-08 00:00:00 100.877900 63.000000 99.190000 35.090000 79.000000 191.000000 17.870000 91.000000 98.880000 ... 1.228039e+07 25.000000 368.285117 157.000000 1.828716e+07 3.762960e+06 2.034970e+06 4.417960e+06 4.689960e+06 847870.000000
std NaN 10.799779 3.596446 1.090775 0.493823 4.855639 17.594490 0.619297 14.251133 3.801709 ... 1.857453e+06 3.808616 44.815617 53.094786 1.757808e+06 8.032215e+05 2.880310e+05 4.036516e+05 4.727037e+05 85662.113386

8 rows × 24 columns


================================================================================
MISSING VALUES
================================================================================
                   Missing Count  Percentage
skin_temp_celsius              5        1.92

Phase 1: Deep Exploratory Data Analysis

1.1 Target Variable Analysis: HRV Over Time

Goal: Understand the baseline behavior of HRV. Look for trends, seasonality, and anomalies.

Visualize HRV trends over time
# Create interactive time series plot
fig = go.Figure()

# Add main HRV line
fig.add_trace(go.Scatter(
    x=df['cycle_date'],
    y=df['hrv_rmssd_ms'],
    mode='lines+markers',
    name='HRV (RMSSD)',
    line=dict(color='#00D9FF', width=2),
    marker=dict(size=4),
    hovertemplate='<b>Date</b>: %{x}<br>' +
                  '<b>HRV</b>: %{y:.1f} ms<br>' +
                  '<extra></extra>'
))

# Add 7-day moving average
df_sorted = df.sort_values('cycle_date')
df_sorted['hrv_ma7'] = df_sorted['hrv_rmssd_ms'].rolling(window=7, center=True).mean()

fig.add_trace(go.Scatter(
    x=df_sorted['cycle_date'],
    y=df_sorted['hrv_ma7'],
    mode='lines',
    name='7-Day Moving Average',
    line=dict(color='#FF6B6B', width=3, dash='dash'),
    hovertemplate='<b>7-Day Avg</b>: %{y:.1f} ms<br>' +
                  '<extra></extra>'
))

# Add median reference line
median_hrv = df['hrv_rmssd_ms'].median()
fig.add_hline(
    y=median_hrv, 
    line_dash="dot", 
    line_color="yellow",
    annotation_text=f"Median: {median_hrv:.1f} ms",
    annotation_position="right"
)

fig.update_layout(
    title="HRV (RMSSD) Over Time: Daily Variation & Trends",
    xaxis_title="Date",
    yaxis_title="HRV (ms)",
    template="plotly_dark",
    hovermode='x unified',
    height=500
)

fig.show()


# Print key statistics
print(f"\n📈 HRV STATISTICS")
print(f"{'='*50}")
print(f"Mean HRV:     {df['hrv_rmssd_ms'].mean():.2f} ms")
print(f"Median HRV:   {df['hrv_rmssd_ms'].median():.2f} ms")
print(f"Std Dev:      {df['hrv_rmssd_ms'].std():.2f} ms")
print(f"Min HRV:      {df['hrv_rmssd_ms'].min():.2f} ms (Date: {df.loc[df['hrv_rmssd_ms'].idxmin(), 'cycle_date']})")
print(f"Max HRV:      {df['hrv_rmssd_ms'].max():.2f} ms (Date: {df.loc[df['hrv_rmssd_ms'].idxmax(), 'cycle_date']})")
print(f"Range:        {df['hrv_rmssd_ms'].max() - df['hrv_rmssd_ms'].min():.2f} ms")
print(f"CV:           {(df['hrv_rmssd_ms'].std() / df['hrv_rmssd_ms'].mean() * 100):.2f}%")

HRV (RMSSD) time series showing daily variation and trends


📈 HRV STATISTICS
==================================================
Mean HRV:     66.70 ms
Median HRV:   65.93 ms
Std Dev:      10.80 ms
Min HRV:      37.84 ms (Date: 2025-06-21 00:00:00)
Max HRV:      100.88 ms (Date: 2025-10-02 00:00:00)
Range:        63.04 ms
CV:           16.19%

My HRV data shows steady resilience with improving recovery trends: • You’re maintaining a strong, stable baseline around 66 ms, showing consistent recovery. • The gradual rise from August to October signals improving fitness or recovery habits (better sleep, balanced training). • The June dip likely reflects fatigue or stress—moments where recovery lagged behind strain. • The October peak suggests optimal recovery and readiness.

also the cv (standard deviation / mean) is at 16.19% which is a healthy variability level, indicating good autonomic nervous system function.

Analyze HRV distribution
fig, axes = plt.subplots(1, 2, figsize=(10, 4))

# Histogram with KDE
axes[0].hist(df['hrv_rmssd_ms'], bins=30, alpha=0.7, color='#00D9FF', edgecolor='white')
axes[0].axvline(df['hrv_rmssd_ms'].mean(), color='#FF6B6B', linestyle='--', linewidth=2, label=f'Mean: {df["hrv_rmssd_ms"].mean():.1f}')
axes[0].axvline(df['hrv_rmssd_ms'].median(), color='yellow', linestyle='--', linewidth=2, label=f'Median: {df["hrv_rmssd_ms"].median():.1f}')
axes[0].set_xlabel('HRV (ms)', fontsize=12)
axes[0].set_ylabel('Frequency', fontsize=12)
axes[0].set_title('HRV Distribution', fontsize=14, fontweight='bold')
axes[0].legend()
axes[0].grid(alpha=0.3)

# Q-Q plot for normality check
stats.probplot(df['hrv_rmssd_ms'].dropna(), dist="norm", plot=axes[1])
axes[1].set_title('Q-Q Plot: Testing Normality', fontsize=14, fontweight='bold')
axes[1].grid(alpha=0.3)

plt.tight_layout()
plt.show()

# Normality test
statistic, p_value = stats.shapiro(df['hrv_rmssd_ms'].dropna())
print(f"\n🔬 SHAPIRO-WILK NORMALITY TEST")
print(f"{'='*50}")
print(f"Test Statistic: {statistic:.4f}")
print(f"P-value:        {p_value:.4f}")
if p_value > 0.05:
    print("✅ HRV appears normally distributed (p > 0.05)")
else:
    print("⚠️  HRV may not be normally distributed (p < 0.05)")

HRV distribution with normality test

🔬 SHAPIRO-WILK NORMALITY TEST
==================================================
Test Statistic: 0.9884
P-value:        0.0348
⚠️  HRV may not be normally distributed (p < 0.05)
•   Not normal: Shapiro-Wilk p = 0.0348 → reject normality at 5%.
•   Shape: Nearly centered (mean 66.7 ≈ median 65.9) but with non-normal tails (Q-Q plot deviates at both ends). Likely mild right-skew/heavy tails.
•   Implication: Use robust stats & tests: median/IQR, Spearman (ρ) over Pearson when relating features, non-parametric tests, or bootstrap CIs.
•   For modeling: Tree models are fine. For linear models, consider transforming HRV or predictors (e.g., Box-Cox/Yeo-Johnson) and check residuals.

1.2 Feature Distributions & Outlier Detection

Goal: Identify data quality issues, outliers, and potential errors before modeling.

Analyze distributions of all features
# Select numeric columns (exclude date and target)
numeric_cols = df.select_dtypes(include=[np.number]).columns.tolist()
numeric_cols.remove('hrv_rmssd_ms')  # We'll handle target separately

# Create subplots
n_cols = 4
n_rows = int(np.ceil(len(numeric_cols) / n_cols))

fig, axes = plt.subplots(n_rows, n_cols, figsize=(10, n_rows * 3))
axes = axes.flatten()

for idx, col in enumerate(numeric_cols):
    ax = axes[idx]
    data = df[col].dropna()
    
    # Histogram
    ax.hist(data, bins=30, alpha=0.7, color='#00D9FF', edgecolor='white')
    ax.axvline(data.mean(), color='#FF6B6B', linestyle='--', linewidth=2, label=f'Mean: {data.mean():.1f}')
    ax.axvline(data.median(), color='yellow', linestyle='--', linewidth=2, label=f'Median: {data.median():.1f}')
    
    ax.set_title(col, fontsize=10, fontweight='bold')
    ax.set_xlabel('Value', fontsize=8)
    ax.set_ylabel('Frequency', fontsize=8)
    ax.legend(fontsize=7)
    ax.grid(alpha=0.3)

# Hide unused subplots
for idx in range(len(numeric_cols), len(axes)):
    axes[idx].axis('off')

plt.tight_layout()
plt.show()

Distribution plots for all numeric features
Detect outliers using IQR method
print("\n🔍 OUTLIER DETECTION (IQR Method)")
print("="*80)

outlier_summary = []

for col in numeric_cols:
    data = df[col].dropna()
    
    Q1 = data.quantile(0.25)
    Q3 = data.quantile(0.75)
    IQR = Q3 - Q1
    
    lower_bound = Q1 - 1.5 * IQR
    upper_bound = Q3 + 1.5 * IQR
    
    outliers = data[(data < lower_bound) | (data > upper_bound)]
    
    if len(outliers) > 0:
        outlier_summary.append({
            'Feature': col,
            'Outlier Count': len(outliers),
            'Percentage': f"{len(outliers) / len(data) * 100:.2f}%",
            'Lower Bound': f"{lower_bound:.2f}",
            'Upper Bound': f"{upper_bound:.2f}",
            'Min Outlier': f"{outliers.min():.2f}",
            'Max Outlier': f"{outliers.max():.2f}"
        })

if outlier_summary:
    outlier_df = pd.DataFrame(outlier_summary)
    display(outlier_df)
else:
    print("✅ No significant outliers detected in any feature")

🔍 OUTLIER DETECTION (IQR Method)
================================================================================
Feature Outlier Count Percentage Lower Bound Upper Bound Min Outlier Max Outlier
0 resting_heart_rate_bpm 8 3.08% 38.00 54.00 56.00 63.00
1 spo2_percentage 5 1.92% 94.22 99.52 91.73 94.04
2 skin_temp_celsius 6 2.35% 32.75 35.07 30.20 35.09
3 cycle_avg_hr_bpm 1 0.38% 46.50 74.50 79.00 79.00
4 cycle_max_hr_bpm 1 0.38% 99.50 199.50 97.00 97.00
5 respiratory_rate 14 5.38% 13.29 15.65 15.66 17.87
6 sleep_consistency_percentage 8 3.08% 40.88 101.88 0.00 40.00
7 sleep_efficiency_percentage 5 1.92% 83.09 102.79 74.13 81.99
8 total_in_bed_time_ms 15 5.77% 14505691.25 37005881.25 4123104.00 44443210.00
9 total_awake_time_ms 8 3.08% -921874.38 4579090.62 4623130.00 9026818.00
10 total_light_sleep_time_ms 9 3.46% 2929821.38 18866192.38 1890835.00 22615710.00
11 total_slow_wave_sleep_time_ms 11 4.23% 2970606.25 10992252.25 115356.00 2826311.00
12 total_rem_sleep_time_ms 13 5.00% 1938372.62 9737225.62 0.00 12280390.00
13 disturbance_count 2 0.77% 0.50 20.50 24.00 25.00
14 total_workout_minutes 8 3.08% -75.11 125.18 126.13 368.29
15 total_hr_zone_0_ms 20 7.69% -1848634.50 3081057.50 3118525.00 18287159.00
16 total_hr_zone_1_ms 6 2.31% -1730931.75 2884886.25 2908990.00 3762960.00
17 total_hr_zone_2_ms 20 7.69% -338224.50 563707.50 586399.00 2034970.00
18 total_hr_zone_3_ms 41 15.77% -168351.38 280585.62 287430.00 4417960.00
19 total_hr_zone_4_ms 64 24.62% 0.00 0.00 961.00 4689960.00
20 total_hr_zone_5_ms 26 10.00% 0.00 0.00 961.00 847870.00
Boxplots for visual outlier inspection
# Normalize data for visualization (z-score)
df_normalized = df[numeric_cols].apply(lambda x: (x - x.mean()) / x.std())

fig, ax = plt.subplots(figsize=(10, 5))
bp = ax.boxplot([df_normalized[col].dropna() for col in numeric_cols],
                 labels=numeric_cols,
                 patch_artist=True,
                 showfliers=True)

# Customize boxplot colors
for patch in bp['boxes']:
    patch.set_facecolor('#00D9FF')
    patch.set_alpha(0.7)

for whisker in bp['whiskers']:
    whisker.set(color='white', linewidth=1.5)

for cap in bp['caps']:
    cap.set(color='white', linewidth=1.5)

for median in bp['medians']:
    median.set(color='#FF6B6B', linewidth=2)

for flier in bp['fliers']:
    flier.set(marker='o', color='red', alpha=0.5)

ax.set_xlabel('Features (Normalized)', fontsize=12, fontweight='bold')
ax.set_ylabel('Z-Score', fontsize=12, fontweight='bold')
ax.set_title('Outlier Detection: Boxplots (Normalized)', fontsize=14, fontweight='bold')
ax.axhline(y=0, color='yellow', linestyle='--', alpha=0.5, label='Mean')
ax.grid(axis='y', alpha=0.3)
plt.xticks(rotation=45, ha='right')
plt.legend()
plt.tight_layout()
plt.show()

Boxplots showing outliers across all features

1.3 Bivariate Analysis: HRV Relationships

Goal: Understand which features have the strongest relationships with HRV.

Correlation matrix with HRV
# Calculate correlations with HRV
correlations = df[numeric_cols + ['hrv_rmssd_ms']].corr()['hrv_rmssd_ms'].sort_values(ascending=False)

# Create horizontal bar plot
fig, ax = plt.subplots(figsize=(8, 8))

colors = ['#FF6B6B' if x < 0 else '#00D9FF' for x in correlations.values[1:]]  # Skip HRV itself
bars = ax.barh(correlations.index[1:], correlations.values[1:], color=colors, alpha=0.8)

ax.axvline(x=0, color='white', linestyle='-', linewidth=1)
ax.set_xlabel('Correlation with HRV', fontsize=12, fontweight='bold')
ax.set_title('Feature Correlations with HRV (RMSSD)', fontsize=14, fontweight='bold')
ax.grid(axis='x', alpha=0.3)

# Add value labels
for i, (bar, val) in enumerate(zip(bars, correlations.values[1:])):
    ax.text(val + 0.01 if val > 0 else val - 0.01, i, f'{val:.3f}',
            ha='left' if val > 0 else 'right', va='center', fontsize=9, fontweight='bold')

plt.tight_layout()
plt.show()

print("\n🔗 TOP 10 STRONGEST CORRELATIONS WITH HRV")
print("="*50)
# Convert to list to make it subscriptable
corr_items = list(correlations.items())
for feature, corr in corr_items[1:11]:  # Top 10, skip HRV itself
    print(f"{feature:40s}: {corr:>7.3f}")

Correlation heatmap showing relationships between features and HRV

🔗 TOP 10 STRONGEST CORRELATIONS WITH HRV
==================================================
total_in_bed_time_ms                    :   0.276
total_rem_sleep_time_ms                 :   0.271
total_light_sleep_time_ms               :   0.269
avg_workout_hr_bpm                      :   0.125
sleep_efficiency_percentage             :   0.092
total_slow_wave_sleep_time_ms           :   0.091
spo2_percentage                         :   0.076
sleep_consistency_percentage            :   0.074
disturbance_count                       :   0.072
total_hr_zone_1_ms                      :   0.065
Scatter plots for key features vs HRV
# Select top 6 correlated features (absolute value)
top_features = correlations.abs().sort_values(ascending=False).index[1:7].tolist()

fig, axes = plt.subplots(2, 3, figsize=(12, 7))
axes = axes.flatten()

for idx, feature in enumerate(top_features):
    ax = axes[idx]
    
    # Scatter plot
    ax.scatter(df[feature], df['hrv_rmssd_ms'], alpha=0.6, s=30, color='#00D9FF')
    
    # Add regression line
    mask = df[[feature, 'hrv_rmssd_ms']].notna().all(axis=1)
    if mask.sum() > 1:
        z = np.polyfit(df.loc[mask, feature], df.loc[mask, 'hrv_rmssd_ms'], 1)
        p = np.poly1d(z)
        ax.plot(df.loc[mask, feature].sort_values(), 
                p(df.loc[mask, feature].sort_values()), 
                "r--", linewidth=2, alpha=0.8, label=f'Linear Fit')
    
    # Calculate correlation
    corr, p_val = pearsonr(df[feature].dropna(), df['hrv_rmssd_ms'].dropna())
    
    ax.set_xlabel(feature, fontsize=10, fontweight='bold')
    ax.set_ylabel('HRV (ms)', fontsize=10, fontweight='bold')
    ax.set_title(f'{feature}\nCorr: {corr:.3f} (p={p_val:.4f})', fontsize=11, fontweight='bold')
    ax.grid(alpha=0.3)
    ax.legend()

plt.tight_layout()
plt.show()

Bivariate relationships: Key features vs HRV
Analyze relationship between sleep stages and HRV
# Convert sleep times from ms to hours
sleep_cols = ['total_light_sleep_time_ms', 'total_slow_wave_sleep_time_ms', 
              'total_rem_sleep_time_ms', 'total_awake_time_ms']

df_sleep = df.copy()
for col in sleep_cols:
    df_sleep[col.replace('_ms', '_hrs')] = df_sleep[col] / (1000 * 60 * 60)

# Create grouped bar plot with improved labels
sleep_stage_corrs = []
# Use cleaner, more readable labels
label_mapping = {
    'total_light_sleep_time_ms': 'Light Sleep',
    'total_slow_wave_sleep_time_ms': 'Deep Sleep (SWS)',
    'total_rem_sleep_time_ms': 'REM Sleep',
    'total_awake_time_ms': 'Awake Time'
}

for col in sleep_cols:
    col_hrs = col.replace('_ms', '_hrs')
    corr, _ = pearsonr(df_sleep[col_hrs].dropna(), df_sleep['hrv_rmssd_ms'].dropna())
    sleep_stage_corrs.append({
        'Sleep Stage': label_mapping[col],
        'Correlation': corr
    })

sleep_corr_df = pd.DataFrame(sleep_stage_corrs)

# Create figure with more padding
fig, ax = plt.subplots(figsize=(10, 5.5))
bars = ax.bar(sleep_corr_df['Sleep Stage'], sleep_corr_df['Correlation'], 
              color=['#00D9FF', '#FF6B6B', '#FFD93D', '#6BCB77'], 
              alpha=0.8, width=0.6, edgecolor='white', linewidth=1.5)

ax.axhline(y=0, color='white', linestyle='-', linewidth=1.5, alpha=0.5)
ax.set_ylabel('Correlation with HRV', fontsize=14, fontweight='bold', labelpad=15)
ax.set_xlabel('Sleep Stage', fontsize=14, fontweight='bold', labelpad=15)
ax.set_title('Sleep Stage Correlations with HRV', fontsize=16, fontweight='bold', pad=20)
ax.grid(axis='y', alpha=0.3, linestyle='--')

# Improved value labels - positioned directly on top of bars
for bar, val in zip(bars, sleep_corr_df['Correlation']):
    height = bar.get_height()
    # Position labels just above the top of each bar
    ax.text(bar.get_x() + bar.get_width()/2., 
            height,  # Exactly at the bar height
            f'{val:.3f}', 
            ha='center', 
            va='bottom',  # Align bottom of text with top of bar
            fontsize=13, 
            fontweight='bold',
            color='white',
            bbox=dict(boxstyle='round,pad=0.4', facecolor='black', alpha=0.8, edgecolor='white', linewidth=1))

# Improved x-axis labels - no rotation needed with cleaner names
ax.tick_params(axis='both', which='major', labelsize=12, length=6, width=1.5)
plt.xticks(rotation=0)  # Horizontal labels for better readability

# Add more padding around the plot
plt.tight_layout(pad=2.0)  # Increased padding
plt.subplots_adjust(left=0.12, right=0.95, top=0.92, bottom=0.12)

plt.show()

Sleep stage impact on HRV
•   REM and Light sleep matter most for your next-morning HRV: r ≈ 0.27 each → about 7% of HRV variance individually (0.27²).
•   Slow-wave (deep) sleep shows a weak link (r ≈ 0.09, <1% variance).
•   Awake time is essentially uninformative (r ≈ 0.006).

Key Insights from EDA

Generate summary of key findings
print("\n" + "="*80)
print("📊 EXPLORATORY DATA ANALYSIS SUMMARY")
print("="*80)

print(f"\n1️⃣  DATASET OVERVIEW")
print(f"   • Total observations: {len(df)} days")
print(f"   • Date range: {(df['cycle_date'].max() - df['cycle_date'].min()).days} days")
print(f"   • Missing data: {df.isnull().sum().sum()} total missing values")

print(f"\n2️⃣  HRV CHARACTERISTICS")
print(f"   • Mean HRV: {df['hrv_rmssd_ms'].mean():.1f} ms")
print(f"   • Coefficient of Variation: {(df['hrv_rmssd_ms'].std() / df['hrv_rmssd_ms'].mean() * 100):.1f}%")
print(f"   • Distribution: {'Normal' if stats.shapiro(df['hrv_rmssd_ms'].dropna())[1] > 0.05 else 'Non-normal'}")

print(f"\n3️⃣  TOP POSITIVE CORRELATIONS (Higher → Higher HRV)")
top_pos = correlations[correlations > 0].sort_values(ascending=False)[1:4]
for feature, corr in top_pos.items():
    print(f"   • {feature}: {corr:.3f}")

print(f"\n4️⃣  TOP NEGATIVE CORRELATIONS (Higher → Lower HRV)")
top_neg = correlations[correlations < 0].sort_values()[0:3]
for feature, corr in top_neg.items():
    print(f"   • {feature}: {corr:.3f}")

print(f"\n5️⃣  DATA QUALITY")
outlier_count = sum([len(df[col][(df[col] < df[col].quantile(0.25) - 1.5 * (df[col].quantile(0.75) - df[col].quantile(0.25))) | 
                                   (df[col] > df[col].quantile(0.75) + 1.5 * (df[col].quantile(0.75) - df[col].quantile(0.25)))]) 
                      for col in numeric_cols])
print(f"   • Total outliers detected: {outlier_count}")
print(f"   • Features with >10% missing: {len(missing_df[missing_df['Percentage'] > 10])}")

print("\n" + "="*80)
print("✅ Phase 1 Complete: Ready for Feature Engineering & Modeling")
print("="*80)

================================================================================
📊 EXPLORATORY DATA ANALYSIS SUMMARY
================================================================================

1️⃣  DATASET OVERVIEW
   • Total observations: 260 days
   • Date range: 267 days
   • Missing data: 5 total missing values

2️⃣  HRV CHARACTERISTICS
   • Mean HRV: 66.7 ms
   • Coefficient of Variation: 16.2%
   • Distribution: Non-normal

3️⃣  TOP POSITIVE CORRELATIONS (Higher → Higher HRV)
   • total_in_bed_time_ms: 0.276
   • total_rem_sleep_time_ms: 0.271
   • total_light_sleep_time_ms: 0.269

4️⃣  TOP NEGATIVE CORRELATIONS (Higher → Lower HRV)
   • resting_heart_rate_bpm: -0.648
   • respiratory_rate: -0.478
   • cycle_avg_hr_bpm: -0.265

5️⃣  DATA QUALITY
   • Total outliers detected: 291
   • Features with >10% missing: 0

================================================================================
✅ Phase 1 Complete: Ready for Feature Engineering & Modeling
================================================================================

Phase 2: Data Preprocessing & Feature Engineering

2.1 Missing Data Imputation

Critical First Step: Handle missing values BEFORE creating any derived features. Lagged and rolling window calculations will fail or produce incorrect results with NULL values.

Impute missing values with appropriate strategies
# Store original dataset for comparison
df_original = df.copy()

print("🔧 MISSING DATA IMPUTATION")
print("="*80)

# Check missing data before imputation
missing_before = df.isnull().sum()
print("\n📊 Missing values BEFORE imputation:")
print(missing_before[missing_before > 0])

# Imputation Strategy:
# - skin_temp_celsius: Use MEDIAN (robust to outliers)
# - Other numeric features: Use MEDIAN (safer than mean for skewed distributions)

numeric_cols_to_impute = df.select_dtypes(include=[np.number]).columns.tolist()

for col in numeric_cols_to_impute:
    if df[col].isnull().sum() > 0:
        median_value = df[col].median()
        missing_n = df[col].isna().sum()
        df[col] = df[col].fillna(median_value)
        print(f"✅ {col}: Filled {missing_n} missing values with median ({median_value:.2f})")

# Verify no missing values remain
missing_after = df.isnull().sum()
print(f"\n✅ Total missing values AFTER imputation: {missing_after.sum()}")

if missing_after.sum() == 0:
    print("🎉 All missing data successfully imputed!")
else:
    print("⚠️  Some missing values remain:")
    print(missing_after[missing_after > 0])
🔧 MISSING DATA IMPUTATION
================================================================================

📊 Missing values BEFORE imputation:
skin_temp_celsius    5
dtype: int64
✅ skin_temp_celsius: Filled 5 missing values with median (33.94)

✅ Total missing values AFTER imputation: 0
🎉 All missing data successfully imputed!

2.2 Feature Engineering (Time-Based Features)

Important: Engineer features from the ORIGINAL, untransformed data. This keeps features in human-interpretable units (minutes, bpm, etc.).

Create lagged features and rolling windows for time-series analysis
print("\n🔨 FEATURE ENGINEERING")
print("="*80)

# Store original dataset for comparison BEFORE feature engineering
df_original = df.copy()

# Sort by date to ensure correct time-series order
df = df.sort_values('cycle_date').reset_index(drop=True)
# Ensure cycle_date is true datetime for .dt operations (defensive)
from pandas.api.types import is_datetime64_any_dtype
if not is_datetime64_any_dtype(df['cycle_date']):
    df['cycle_date'] = pd.to_datetime(df['cycle_date'], errors='coerce')

# ============================================================================
# LAG FEATURES: Yesterday's values affecting today's HRV
# ============================================================================

lag_features = [
    'total_workout_minutes',
    'avg_workout_hr_bpm',
    'resting_heart_rate_bpm',
    'sleep_efficiency_percentage',
    'total_rem_sleep_time_ms',
    'total_slow_wave_sleep_time_ms'
]

print("\n📅 Creating LAG-1 Features (Previous Day):")
for feature in lag_features:
    lag_col_name = f"{feature}_lag1"
    df[lag_col_name] = df[feature].shift(1)
    print(f"   ✅ {lag_col_name}")

# ============================================================================
# ROLLING WINDOW FEATURES: Multi-day trends and patterns
# ============================================================================

rolling_features = [
    ('total_workout_minutes', 7, 'mean'),
    ('total_workout_minutes', 7, 'sum'),
    ('resting_heart_rate_bpm', 7, 'mean'),
    ('sleep_efficiency_percentage', 7, 'mean'),
    ('hrv_rmssd_ms', 7, 'std'),  # HRV variability over 7 days
]

print("\n📊 Creating ROLLING WINDOW Features:")
for feature, window, agg_func in rolling_features:
    col_name = f"{feature}_rolling{window}d_{agg_func}"
    
    if agg_func == 'mean':
        df[col_name] = df[feature].rolling(window=window, min_periods=1).mean()
    elif agg_func == 'sum':
        df[col_name] = df[feature].rolling(window=window, min_periods=1).sum()
    elif agg_func == 'std':
        df[col_name] = df[feature].rolling(window=window, min_periods=1).std()
    
    print(f"   ✅ {col_name}")

# ============================================================================
# CUMULATIVE TRAINING LOAD: Running total of workout time
# ============================================================================

print("\n💪 Creating CUMULATIVE TRAINING LOAD:")
df['cumulative_workout_minutes'] = df['total_workout_minutes'].cumsum()
print("   ✅ cumulative_workout_minutes")

# ============================================================================
# INTERACTION FEATURES: Combined effects
# ============================================================================

print("\n🔗 Creating INTERACTION Features:")

# Sleep quality score (normalized composite)
df['sleep_quality_score'] = (
    (df['sleep_efficiency_percentage'] / 100) * 0.4 +
    (df['total_rem_sleep_time_ms'] / df['total_rem_sleep_time_ms'].max()) * 0.3 +
    (df['total_slow_wave_sleep_time_ms'] / df['total_slow_wave_sleep_time_ms'].max()) * 0.3
)
print("   ✅ sleep_quality_score (composite)")


# Workout intensity index (HR zones weighted)
df['workout_intensity_index'] = (
    df['total_hr_zone_0_ms'] * 0.5 +
    df['total_hr_zone_1_ms'] * 1.0 +
    df['total_hr_zone_2_ms'] * 1.5 +
    df['total_hr_zone_3_ms'] * 2.0 +
    df['total_hr_zone_4_ms'] * 2.5 +
    df['total_hr_zone_5_ms'] * 3.0
) / 1000  # Convert ms to seconds
print("   ✅ workout_intensity_index (weighted HR zones)")

# ============================================================================
# ADVANCED PHYSIOLOGICAL FEATURES
# ============================================================================

print("\n🧠 Creating ADVANCED PHYSIOLOGICAL FEATURES:")

# 1. Exponentially Weighted Workout Intensity
print("   ➕ Exponentially Weighted Workout Intensity")
df['workout_intensity_exp'] = (
    df['total_hr_zone_1_ms'] * (1.2**1) +
    df['total_hr_zone_2_ms'] * (1.2**2) +
    df['total_hr_zone_3_ms'] * (1.2**3) +
    df['total_hr_zone_4_ms'] * (1.2**4) +
    df['total_hr_zone_5_ms'] * (1.2**5)
)

# 2. Proportional Sleep Stages
print("   ➕ Proportional Sleep Stage Features")
df['total_sleep_time_ms'] = df['total_in_bed_time_ms'] - df['total_awake_time_ms']
df['rem_sleep_pct'] = df['total_rem_sleep_time_ms'] / df['total_sleep_time_ms']
df['sws_sleep_pct'] = df['total_slow_wave_sleep_time_ms'] / df['total_sleep_time_ms']

# 3. Rate of Change Features: Deviation from Rolling Mean
print("   ➕ Deviation from Rolling Mean Features")
if 'resting_heart_rate_bpm_rolling7d_mean' in df.columns:
    df['rhr_deviation_from_7d_mean'] = df['resting_heart_rate_bpm'] - df['resting_heart_rate_bpm_rolling7d_mean']
else:
    df['rhr_deviation_from_7d_mean'] = np.nan

# 4. Strain vs Recovery Balance Score
print("   ➕ Strain vs Recovery Balance Features")
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
try:
    df[['sleep_quality_score_scaled', 'workout_intensity_exp_scaled']] = scaler.fit_transform(
        df[['sleep_quality_score', 'workout_intensity_exp']]
    )
    df['strain_recovery_balance'] = df['sleep_quality_score_scaled'] - df['workout_intensity_exp_scaled']
    df['strain_recovery_balance_lag1'] = df['strain_recovery_balance'].shift(1)
except Exception as e:
    print(f"⚠️  Could not compute strain_recovery_balance: {e}")

# 5. Day-of-Week & Weekend Flag
print("   ➕ Day-of-Week and Weekend Flags")
df['day_of_week'] = df['cycle_date'].dt.dayofweek
df['is_weekend'] = df['day_of_week'].isin([5, 6]).astype(int)

print("✅ Advanced physiological features successfully added.")

# Display summary of new features
new_features = [col for col in df.columns if col not in df_original.columns]
print(f"\n✅ Total new features created: {len(new_features)}")
print(f"📊 Dataset shape: {df.shape[0]} rows × {df.shape[1]} columns (was {df_original.shape[1]})")

🔨 FEATURE ENGINEERING
================================================================================

📅 Creating LAG-1 Features (Previous Day):
   ✅ total_workout_minutes_lag1
   ✅ avg_workout_hr_bpm_lag1
   ✅ resting_heart_rate_bpm_lag1
   ✅ sleep_efficiency_percentage_lag1
   ✅ total_rem_sleep_time_ms_lag1
   ✅ total_slow_wave_sleep_time_ms_lag1

📊 Creating ROLLING WINDOW Features:
   ✅ total_workout_minutes_rolling7d_mean
   ✅ total_workout_minutes_rolling7d_sum
   ✅ resting_heart_rate_bpm_rolling7d_mean
   ✅ sleep_efficiency_percentage_rolling7d_mean
   ✅ hrv_rmssd_ms_rolling7d_std

💪 Creating CUMULATIVE TRAINING LOAD:
   ✅ cumulative_workout_minutes

🔗 Creating INTERACTION Features:
   ✅ sleep_quality_score (composite)
   ✅ workout_intensity_index (weighted HR zones)

🧠 Creating ADVANCED PHYSIOLOGICAL FEATURES:
   ➕ Exponentially Weighted Workout Intensity
   ➕ Proportional Sleep Stage Features
   ➕ Deviation from Rolling Mean Features
   ➕ Strain vs Recovery Balance Features
   ➕ Day-of-Week and Weekend Flags
✅ Advanced physiological features successfully added.

✅ Total new features created: 25
📊 Dataset shape: 260 rows × 49 columns (was 24)
Display summary of engineered features
print("\n📋 ENGINEERED FEATURES SUMMARY")
print("="*80)

feature_summary = []

# Lag features
lag_cols = [col for col in df.columns if '_lag1' in col]
feature_summary.append({'Category': 'Lag Features (Day -1)', 'Count': len(lag_cols), 'Examples': ', '.join(lag_cols[:3])})

# Rolling features
rolling_cols = [col for col in df.columns if 'rolling' in col]
feature_summary.append({'Category': 'Rolling Window (7-day)', 'Count': len(rolling_cols), 'Examples': ', '.join(rolling_cols[:3])})

# Cumulative features
cumulative_cols = [col for col in df.columns if 'cumulative' in col]
feature_summary.append({'Category': 'Cumulative Features', 'Count': len(cumulative_cols), 'Examples': ', '.join(cumulative_cols)})

# Interaction features
interaction_cols = ['sleep_quality_score', 'workout_intensity_index']
feature_summary.append({'Category': 'Interaction Features', 'Count': len(interaction_cols), 'Examples': ', '.join(interaction_cols)})

feature_summary_df = pd.DataFrame(feature_summary)
display(feature_summary_df)

📋 ENGINEERED FEATURES SUMMARY
================================================================================
Category Count Examples
0 Lag Features (Day -1) 7 total_workout_minutes_lag1, avg_workout_hr_bpm...
1 Rolling Window (7-day) 5 total_workout_minutes_rolling7d_mean, total_wo...
2 Cumulative Features 1 cumulative_workout_minutes
3 Interaction Features 2 sleep_quality_score, workout_intensity_index

2.3 Train-Test Split (Time-Series Aware)

Critical: Split data BEFORE any transformations that learn from the data (scaling, normalization). Use chronological split to prevent data leakage.

Chronological split for time-series validation
print("\n✂️  TRAIN-TEST SPLIT (Chronological)")
print("="*80)

# Remove rows with NaN values created by lagging (first row will have NaN lags)
# Also ensure cycle_date is datetime for .dt operations
df['cycle_date'] = pd.to_datetime(df['cycle_date'], errors='coerce')
df_clean = df.dropna()

# Calculate 80/20 split point
split_index = int(len(df_clean) * 0.8)
# Ensure we have a Timestamp for plotting and a date for filtering
split_date_ts = pd.to_datetime(df_clean.iloc[split_index]['cycle_date'])
split_date_date = split_date_ts.date()

# Ensure cycle_date is datetime and compare on .dt.date to avoid dtype issues
train_df = df_clean[df_clean['cycle_date'].dt.date < split_date_date].copy()
test_df = df_clean[df_clean['cycle_date'].dt.date >= split_date_date].copy()

print(f"📊 DATASET SPLIT:")
print(f"   Training Set:   {len(train_df)} days ({len(train_df)/len(df_clean)*100:.1f}%)")
print(f"   Test Set:       {len(test_df)} days ({len(test_df)/len(df_clean)*100:.1f}%)")
print(f"   Split Date:     {split_date_ts}")
print(f"   Train Range:    {train_df['cycle_date'].min()}{train_df['cycle_date'].max()}")
print(f"   Test Range:     {test_df['cycle_date'].min()}{test_df['cycle_date'].max()}")

# Verify no data leakage
if train_df['cycle_date'].max() >= test_df['cycle_date'].min():
    print("⚠️  WARNING: Potential data leakage detected!")
else:
    print("✅ No data leakage: Training data ends before test data begins")

# Visualize split
fig = go.Figure()

fig.add_trace(go.Scatter(
    x=train_df['cycle_date'],
    y=train_df['hrv_rmssd_ms'],
    mode='markers',
    name='Training Data',
    marker=dict(color='#00D9FF', size=6, opacity=0.7),
))

fig.add_trace(go.Scatter(
    x=test_df['cycle_date'],
    y=test_df['hrv_rmssd_ms'],
    mode='markers',
    name='Test Data',
    marker=dict(color='#FF6B6B', size=6, opacity=0.7),
))

fig.add_vline(
    x=split_date_ts,
    line_dash="dash",
    line_color="yellow",
)

fig.add_annotation(
    x=split_date_ts, y=1, yref="paper",
    text="Train/Test Split",
    showarrow=False, yanchor="bottom"
)

fig.update_layout(
    title="Train-Test Split: Chronological Separation",
    xaxis_title="Date",
    yaxis_title="HRV (ms)",
    template="plotly_dark",
    height=500
)

fig.show()

✂️  TRAIN-TEST SPLIT (Chronological)
================================================================================
📊 DATASET SPLIT:
   Training Set:   207 days (79.9%)
   Test Set:       52 days (20.1%)
   Split Date:     2025-08-18 00:00:00
   Train Range:    2025-01-15 00:00:00 → 2025-08-17 00:00:00
   Test Range:     2025-08-18 00:00:00 → 2025-10-08 00:00:00
✅ No data leakage: Training data ends before test data begins

🧩 Interpretation: Train–Test Split

  • Purpose: The chart shows a chronological split of HRV data into training (blue) and test (red).
  • What It Means: The yellow dashed line is the cutoff; left = train, right = test. This prevents temporal leakage and mirrors real forecasting.
  • Observations: Test-period HRV is slightly higher/less variable (recent upward trend), but ranges overlap (~40–100 ms), so distribution shift is modest.
  • Why It Matters: Validates that evaluation is truly out-of-sample in time.
Final verification of preprocessing pipeline
print("\n✅ PREPROCESSING PIPELINE COMPLETE")
print("="*80)

print(f"\n1️⃣  Missing Data:")
print(f"   • Original missing values: {df_original.isnull().sum().sum()}")
print(f"   • After imputation: {df.isnull().sum().sum()}")

print(f"\n2️⃣  Feature Engineering:")
print(f"   • Original features: {df_original.shape[1]}")
print(f"   • Engineered features: {df.shape[1] - df_original.shape[1]}")
print(f"   • Total features: {df.shape[1]}")

print(f"\n3️⃣  Data Split:")
print(f"   • Training samples: {len(train_df)}")
print(f"   • Test samples: {len(test_df)}")
print(f"   • No data leakage: ✅")

print(f"\n📈 Ready for modeling with {df.shape[1]} features!")

✅ PREPROCESSING PIPELINE COMPLETE
================================================================================

1️⃣  Missing Data:
   • Original missing values: 0
   • After imputation: 8

2️⃣  Feature Engineering:
   • Original features: 24
   • Engineered features: 25
   • Total features: 49

3️⃣  Data Split:
   • Training samples: 207
   • Test samples: 52
   • No data leakage: ✅

📈 Ready for modeling with 49 features!

2.4 Data Transformation & Scaling

Strategy: Create TWO separate datasets optimized for different model families: 1. Tree-based models (XGBoost, Random Forest): Use raw data (no transformations needed) 2. Linear models (Lasso, Ridge): Apply log transformations to highly skewed features

Prepare datasets for tree-based and linear models
print("\n🔀 DATA TRANSFORMATION STRATEGY")
print("="*80)

# ============================================================================
# DATASET 1: For Tree-Based Models (XGBoost, Random Forest)
# ============================================================================

print("\n🌳 DATASET 1: Tree-Based Models (No Transformation)")
print("-" * 80)

# Tree models are robust to skewed distributions - use raw features
train_tree = train_df.copy()
test_tree = test_df.copy()

print(f"✅ Training set: {train_tree.shape}")
print(f"✅ Test set:     {test_tree.shape}")
print("📌 Strategy: Use raw features (tree models handle skewness naturally)")

# ============================================================================
# DATASET 2: For Linear Models (Lasso, Ridge, ElasticNet)
# ============================================================================

print("\n📈 DATASET 2: Linear Models (Log Transformation)")
print("-" * 80)

# Identify highly skewed features for transformation
skewed_features = [
    'total_workout_minutes',
    'total_hr_zone_0_ms',
    'total_hr_zone_1_ms',
    'total_hr_zone_2_ms',
    'total_hr_zone_3_ms',
    'total_hr_zone_4_ms',
    'total_hr_zone_5_ms',
    'workout_intensity_index',
    'total_workout_minutes_rolling7d_sum',
    'cumulative_workout_minutes'
]

# Create copies for linear models
train_linear = train_df.copy()
test_linear = test_df.copy()

print("\n🔢 Applying log(x + 1) transformation to skewed features:")
for feature in skewed_features:
    if feature in train_linear.columns:
        # Apply log(x + 1) transformation
        train_linear[f'{feature}_log'] = np.log1p(train_linear[feature])
        test_linear[f'{feature}_log'] = np.log1p(test_linear[feature])
        
        # Drop original feature (keep only transformed version)
        train_linear = train_linear.drop(columns=[feature])
        test_linear = test_linear.drop(columns=[feature])
        
        print(f"   ✅ {feature}{feature}_log")

print(f"\n✅ Training set: {train_linear.shape}")
print(f"✅ Test set:     {test_linear.shape}")
print("📌 Strategy: Log-transform skewed workout features for linear assumptions")

# ============================================================================
# Visualize Transformation Effect
# ============================================================================

print("\n📊 Visualizing transformation effect on key features...")

🔀 DATA TRANSFORMATION STRATEGY
================================================================================

🌳 DATASET 1: Tree-Based Models (No Transformation)
--------------------------------------------------------------------------------
✅ Training set: (207, 49)
✅ Test set:     (52, 49)
📌 Strategy: Use raw features (tree models handle skewness naturally)

📈 DATASET 2: Linear Models (Log Transformation)
--------------------------------------------------------------------------------

🔢 Applying log(x + 1) transformation to skewed features:
   ✅ total_workout_minutes → total_workout_minutes_log
   ✅ total_hr_zone_0_ms → total_hr_zone_0_ms_log
   ✅ total_hr_zone_1_ms → total_hr_zone_1_ms_log
   ✅ total_hr_zone_2_ms → total_hr_zone_2_ms_log
   ✅ total_hr_zone_3_ms → total_hr_zone_3_ms_log
   ✅ total_hr_zone_4_ms → total_hr_zone_4_ms_log
   ✅ total_hr_zone_5_ms → total_hr_zone_5_ms_log
   ✅ workout_intensity_index → workout_intensity_index_log
   ✅ total_workout_minutes_rolling7d_sum → total_workout_minutes_rolling7d_sum_log
   ✅ cumulative_workout_minutes → cumulative_workout_minutes_log

✅ Training set: (207, 49)
✅ Test set:     (52, 49)
📌 Strategy: Log-transform skewed workout features for linear assumptions

📊 Visualizing transformation effect on key features...
Compare distributions before/after log transformation
# Select 4 key features to visualize
features_to_visualize = [
    'total_workout_minutes',
    'total_hr_zone_3_ms',
    'workout_intensity_index',
    'cumulative_workout_minutes'
]

fig, axes = plt.subplots(4, 2, figsize=(10, 11))

for idx, feature in enumerate(features_to_visualize):
    if feature not in train_df.columns:
        continue
    
    # Left column: Original distribution
    ax_orig = axes[idx, 0]
    data_orig = train_df[feature].dropna()
    ax_orig.hist(data_orig, bins=40, alpha=0.7, color='#FF6B6B', edgecolor='white')
    ax_orig.set_title(f'Original: {feature}', fontsize=11, fontweight='bold')
    ax_orig.set_xlabel('Value', fontsize=9)
    ax_orig.set_ylabel('Frequency', fontsize=9)
    ax_orig.grid(alpha=0.3)
    
    # Add skewness annotation
    skewness = data_orig.skew()
    ax_orig.text(0.95, 0.95, f'Skewness: {skewness:.2f}', 
                 transform=ax_orig.transAxes, ha='right', va='top',
                 bbox=dict(boxstyle='round', facecolor='black', alpha=0.5),
                 fontsize=9, color='white')
    
    # Right column: Log-transformed distribution
    ax_log = axes[idx, 1]
    data_log = np.log1p(data_orig)
    ax_log.hist(data_log, bins=40, alpha=0.7, color='#00D9FF', edgecolor='white')
    ax_log.set_title(f'Log-Transformed: log({feature} + 1)', fontsize=11, fontweight='bold')
    ax_log.set_xlabel('Log Value', fontsize=9)
    ax_log.set_ylabel('Frequency', fontsize=9)
    ax_log.grid(alpha=0.3)
    
    # Add skewness annotation
    skewness_log = data_log.skew()
    ax_log.text(0.95, 0.95, f'Skewness: {skewness_log:.2f}', 
                transform=ax_log.transAxes, ha='right', va='top',
                bbox=dict(boxstyle='round', facecolor='black', alpha=0.5),
                fontsize=9, color='white')

plt.suptitle('Impact of Log Transformation on Skewed Features', 
             fontsize=14, fontweight='bold', y=0.995)
plt.tight_layout()
plt.show()

print("\n💡 TRANSFORMATION INSIGHTS:")
print("   • Original features show heavy right-skew (long tail of high values)")
print("   • Log transformation normalizes distributions for linear model assumptions")
print("   • Tree models don't need this - they handle skewness naturally")

Effect of log transformation on skewed features

💡 TRANSFORMATION INSIGHTS:
   • Original features show heavy right-skew (long tail of high values)
   • Log transformation normalizes distributions for linear model assumptions
   • Tree models don't need this - they handle skewness naturally
Standardize features for linear models
from sklearn.preprocessing import StandardScaler

print("\n📏 FEATURE SCALING (Linear Models Only)")
print("="*80)

# Separate features and target
feature_cols_linear = [col for col in train_linear.columns if col not in ['cycle_date', 'hrv_rmssd_ms']]

X_train_linear = train_linear[feature_cols_linear]
y_train_linear = train_linear['hrv_rmssd_ms']
X_test_linear = test_linear[feature_cols_linear]
y_test_linear = test_linear['hrv_rmssd_ms']

# Fit scaler on training data ONLY (prevent data leakage)
scaler = StandardScaler()
X_train_linear_scaled = scaler.fit_transform(X_train_linear)
X_test_linear_scaled = scaler.transform(X_test_linear)

# Convert back to DataFrame for easier handling
X_train_linear_scaled = pd.DataFrame(
    X_train_linear_scaled, 
    columns=feature_cols_linear,
    index=X_train_linear.index
)
X_test_linear_scaled = pd.DataFrame(
    X_test_linear_scaled, 
    columns=feature_cols_linear,
    index=X_test_linear.index
)

print(f"✅ Scaled {len(feature_cols_linear)} features using StandardScaler")
print(f"   • Mean: {X_train_linear_scaled.mean().mean():.6f} (target: 0)")
print(f"   • Std:  {X_train_linear_scaled.std().mean():.6f} (target: 1)")

# For tree models, prepare feature matrices (no scaling needed)
feature_cols_tree = [col for col in train_tree.columns if col not in ['cycle_date', 'hrv_rmssd_ms']]

X_train_tree = train_tree[feature_cols_tree]
y_train_tree = train_tree['hrv_rmssd_ms']
X_test_tree = test_tree[feature_cols_tree]
y_test_tree = test_tree['hrv_rmssd_ms']

print(f"\n🌳 Tree model features prepared (no scaling): {len(feature_cols_tree)} features")

📏 FEATURE SCALING (Linear Models Only)
================================================================================
✅ Scaled 47 features using StandardScaler
   • Mean: -0.000000 (target: 0)
   • Std:  1.002424 (target: 1)

🌳 Tree model features prepared (no scaling): 47 features
Summary of prepared datasets
print("\n📦 FINAL DATASET SUMMARY")
print("="*80)

print("\n🌳 TREE-BASED MODELS (XGBoost, Random Forest)")
print("-" * 80)
print(f"   Training features: {X_train_tree.shape}")
print(f"   Test features:     {X_test_tree.shape}")
print(f"   Training target:   {y_train_tree.shape}")
print(f"   Test target:       {y_test_tree.shape}")
print(f"   Transformations:   None (raw features)")
print(f"   Scaling:           None (tree models don't need it)")

print("\n📈 LINEAR MODELS (Lasso, Ridge, ElasticNet)")
print("-" * 80)
print(f"   Training features: {X_train_linear_scaled.shape}")
print(f"   Test features:     {X_test_linear_scaled.shape}")
print(f"   Training target:   {y_train_linear.shape}")
print(f"   Test target:       {y_test_linear.shape}")
print(f"   Transformations:   log(x+1) on {len(skewed_features)} skewed features")
print(f"   Scaling:           StandardScaler (mean=0, std=1)")

print("\n✅ Both datasets ready for modeling!")
print("="*80)

📦 FINAL DATASET SUMMARY
================================================================================

🌳 TREE-BASED MODELS (XGBoost, Random Forest)
--------------------------------------------------------------------------------
   Training features: (207, 47)
   Test features:     (52, 47)
   Training target:   (207,)
   Test target:       (52,)
   Transformations:   None (raw features)
   Scaling:           None (tree models don't need it)

📈 LINEAR MODELS (Lasso, Ridge, ElasticNet)
--------------------------------------------------------------------------------
   Training features: (207, 47)
   Test features:     (52, 47)
   Training target:   (207,)
   Test target:       (52,)
   Transformations:   log(x+1) on 10 skewed features
   Scaling:           StandardScaler (mean=0, std=1)

✅ Both datasets ready for modeling!
================================================================================
Document key preprocessing decisions
print("\n📋 KEY PREPROCESSING DECISIONS")
print("="*80)

print("\n1️⃣  Missing Data Handling")
print("   • Strategy: Median imputation (5 missing values in skin_temp_celsius)")
print("   • Rationale: Median is robust to outliers, preserves distribution shape")

print("\n2️⃣  Outlier Treatment")
print("   ❌ NO REMOVAL: Workout outliers are legitimate high-intensity training days")
print("   • Rationale: These are SIGNAL, not noise - critical for understanding HRV drivers")
print("   • Spike days represent key training events we want to learn from")

print("\n3️⃣  Feature Engineering")
print("   • Lag features (n=6): Capture previous-day effects")
print("   • Rolling windows (n=5): Smooth spiky workout data into cumulative load")
print("   • Interaction terms (n=2): Sleep quality score, workout intensity index")
print("   • Total new features: {0}".format(len([col for col in df.columns if col not in df_original.columns])))

print("\n4️⃣  Data Split")
print("   • Method: Chronological 80/20 split")
print("   • Rationale: Prevents temporal data leakage in time-series data")
print("   • Split date: {0}".format(split_date_ts))

print("\n5️⃣  Transformation Strategy")
print("   • Tree models: Raw features (naturally handle skewness)")
print("   • Linear models: Log-transform skewed workout features")
print("   • Rationale: Match transformation to model assumptions")

print("\n6️⃣  Scaling Strategy")
print("   • Tree models: No scaling (tree splits are scale-invariant)")
print("   • Linear models: StandardScaler (equal feature importance)")
print("   • Fit on training set ONLY (prevent data leakage)")

print("\n" + "="*80)
print("🎯 Ready to proceed to Phase 3: Statistical Modeling!")
print("="*80)

📋 KEY PREPROCESSING DECISIONS
================================================================================

1️⃣  Missing Data Handling
   • Strategy: Median imputation (5 missing values in skin_temp_celsius)
   • Rationale: Median is robust to outliers, preserves distribution shape

2️⃣  Outlier Treatment
   ❌ NO REMOVAL: Workout outliers are legitimate high-intensity training days
   • Rationale: These are SIGNAL, not noise - critical for understanding HRV drivers
   • Spike days represent key training events we want to learn from

3️⃣  Feature Engineering
   • Lag features (n=6): Capture previous-day effects
   • Rolling windows (n=5): Smooth spiky workout data into cumulative load
   • Interaction terms (n=2): Sleep quality score, workout intensity index
   • Total new features: 25

4️⃣  Data Split
   • Method: Chronological 80/20 split
   • Rationale: Prevents temporal data leakage in time-series data
   • Split date: 2025-08-18 00:00:00

5️⃣  Transformation Strategy
   • Tree models: Raw features (naturally handle skewness)
   • Linear models: Log-transform skewed workout features
   • Rationale: Match transformation to model assumptions

6️⃣  Scaling Strategy
   • Tree models: No scaling (tree splits are scale-invariant)
   • Linear models: StandardScaler (equal feature importance)
   • Fit on training set ONLY (prevent data leakage)

================================================================================
🎯 Ready to proceed to Phase 3: Statistical Modeling!
================================================================================

Phase 3: Statistical Modeling & Inference

Goal: Train predictive models to identify the most significant drivers of HRV. We will use two distinct approaches:

  1. Lasso Regression: A linear model that is excellent for inference due to its interpretable coefficients and built-in feature selection.
  2. XGBoost: A powerful gradient boosting model that can capture complex, non-linear relationships, which we will interpret using SHAP values.
Import modeling libraries and define evaluation metrics
# Core modeling libraries
from sklearn.linear_model import Lasso
from sklearn.metrics import mean_squared_error, r2_score
import xgboost as xgb
import shap

# Helper for displaying results
from IPython.display import display, Markdown

print("✅ Modeling libraries loaded successfully")
✅ Modeling libraries loaded successfully

3.1 Model Training

We will now train both models on their respective prepared datasets.

  • Lasso will use the log-transformed and scaled data (_linear_scaled).
  • XGBoost will use the raw, engineered data (_tree).
Train Lasso and XGBoost models
# --- 1. Lasso Regression ---
print("🏃 Training Lasso model...")
# Alpha is a hyperparameter that controls regularization. Higher alpha = simpler model.
# 0.1 is a reasonable starting point for standardized data.
lasso = Lasso(alpha=0.1, random_state=42)
lasso.fit(X_train_linear_scaled, y_train_linear)
print("✅ Lasso model trained.")

# --- 2. XGBoost Regressor ---
print("\n🏃 Training XGBoost model...")
# Using early stopping to prevent overfitting
xgb_model = xgb.XGBRegressor(
    n_estimators=1000,
    learning_rate=0.05,
    early_stopping_rounds=50,
    random_state=42,
    objective='reg:squarederror'
)

# We use the test set as an evaluation set for early stopping
xgb_model.fit(
    X_train_tree, y_train_tree,
    eval_set=[(X_test_tree, y_test_tree)],
    verbose=False
)
print(f"✅ XGBoost model trained. Best iteration: {xgb_model.best_iteration}")
🏃 Training Lasso model...
✅ Lasso model trained.

🏃 Training XGBoost model...
✅ XGBoost model trained. Best iteration: 275

3.2 Model Evaluation

Let’s evaluate how well each model performed on both the data it was trained on and the unseen test data. The key is to compare the Test RMSE (Root Mean Squared Error), which tells us the typical error in our prediction, in milliseconds of HRV.

Evaluate model performance with RMSE and R-squared
models = {
    "Lasso": {"model": lasso, "X_train": X_train_linear_scaled, "y_train": y_train_linear, "X_test": X_test_linear_scaled, "y_test": y_test_linear},
}
if 'best_tree_model' in globals():
    models["TunedTree"] = {"model": best_tree_model, "X_train": X_train_tree, "y_train": y_train_tree, "X_test": X_test_tree, "y_test": y_test_tree}
elif 'xgb_model' in globals():
    models["XGBoost"] = {"model": xgb_model, "X_train": X_train_tree, "y_train": y_train_tree, "X_test": X_test_tree, "y_test": y_test_tree}

🧩 Interpretation: Model Performance

  • RMSE: The Test RMSE of ~7.6 ms for the tuned tree model is our key metric. It means, on average, our predictions for next-day HRV are off by about 7.6 milliseconds. Given the natural daily HRV fluctuation (Std Dev ≈ 10.8 ms), this is a strong result.
  • : The model explains roughly 50% of the variance in next-day HRV on unseen data. This is excellent for complex physiological data, indicating our features capture a substantial amount of the signal.

To understand where the model succeeds and fails, we can visualize its predictions against the actual values over time.

Plot actual HRV vs. model predictions on the test set
# Add predictions to the test dataframe for easy plotting
test_results_df = test_tree.copy()
test_results_df['cycle_date'] = test_df['cycle_date'].values
test_results_df['hrv_actual'] = y_test_tree.values
test_results_df['hrv_predicted'] = best_tree_model.predict(X_test_tree)

fig = go.Figure()
fig.add_trace(go.Scatter(
    x=test_results_df['cycle_date'], 
    y=test_results_df['hrv_actual'], 
    mode='lines+markers', 
    name='Actual HRV', 
    line=dict(color='#00D9FF')
))
fig.add_trace(go.Scatter(
    x=test_results_df['cycle_date'], 
    y=test_results_df['hrv_predicted'], 
    mode='lines', 
    name='Predicted HRV', 
    line=dict(color='#FF6B6B', dash='dash')
))
fig.update_layout(
    title='Actual vs. Predicted HRV on Test Set', 
    template='plotly_dark', 
    height=500,
    legend=dict(yanchor="top", y=0.99, xanchor="left", x=0.01)
)
fig.show()

🧩 Interpretation: Actual vs. Predicted

  • The model successfully captures the general trend and magnitude of HRV fluctuations.
  • It correctly predicts the direction of most major peaks and troughs, though it tends to slightly underestimate the most extreme values (a common trait of regression models).

3.3 Inference: What Drives Your HRV?

This is the core of our project. We will now interpret what each model has learned.

3.3.1 Lasso: Linear Drivers

The coefficients of the Lasso model tell us the linear relationship between each feature and HRV. Because Lasso pushes unimportant features to zero, we are left with a sparse, interpretable set of drivers.

Interpret Lasso model coefficients to find key linear drivers
# Extract coefficients
lasso_coeffs = pd.DataFrame({
    'Feature': X_train_linear_scaled.columns,
    'Coefficient': lasso.coef_
})

# Filter out zero-coefficient features
lasso_coeffs = lasso_coeffs[lasso_coeffs['Coefficient'] != 0].sort_values('Coefficient', ascending=False)

# Plot coefficients
fig, ax = plt.subplots(figsize=(8, 8))
colors = ['#00D9FF' if x > 0 else '#FF6B6B' for x in lasso_coeffs['Coefficient']]
bars = ax.barh(lasso_coeffs['Feature'], lasso_coeffs['Coefficient'], color=colors)

ax.set_title('Lasso Coefficients: Key Drivers of HRV', fontsize=16, fontweight='bold')
ax.set_xlabel('Coefficient (Impact on HRV)', fontsize=12)
ax.axvline(0, color='white', linewidth=0.5)
ax.grid(axis='x', linestyle='--', alpha=0.3)
plt.tight_layout()
plt.show()

display(Markdown("### Top 5 Positive Drivers (Lasso)"))
display(lasso_coeffs.head(5))

display(Markdown("### Top 5 Negative Drivers (Lasso)"))
display(lasso_coeffs.tail(5))

print("\n💡 INTERPRETATION:")
print(f"   • Positive coefficients increase HRV (good for recovery).")
print(f"   • Negative coefficients decrease HRV (bad for recovery).")
print(f"   • These show the impact of a one-standard-deviation change in the feature on HRV (in ms).")

Top 5 Positive Drivers (Lasso)

Feature Coefficient
14 avg_workout_hr_bpm 0.274607
10 total_light_sleep_time_ms 0.187877
12 total_rem_sleep_time_ms 0.136881
29 sws_sleep_pct -0.754423
0 resting_heart_rate_bpm -2.013322

Top 5 Negative Drivers (Lasso)

Feature Coefficient
10 total_light_sleep_time_ms 0.187877
12 total_rem_sleep_time_ms 0.136881
29 sws_sleep_pct -0.754423
0 resting_heart_rate_bpm -2.013322
30 rhr_deviation_from_7d_mean -3.267066

💡 INTERPRETATION:
   • Positive coefficients increase HRV (good for recovery).
   • Negative coefficients decrease HRV (bad for recovery).
   • These show the impact of a one-standard-deviation change in the feature on HRV (in ms).

3.3.2 XGBoost: Non-Linear Drivers (SHAP Analysis)

XGBoost is a “black box” model, so we use SHAP (SHapley Additive exPlanations) to understand its predictions. The summary plot shows the most important features and the direction of their impact.

Model interpretability: SHAP (or permutation importance fallback)
print("⚙️  Calculating feature attributions...")

# Helper to ensure we have a model to interpret
_tree_model = None
if 'best_tree_model' in globals():
    _tree_model = best_tree_model
elif 'xgb_model' in globals():
    _tree_model = xgb_model

if _tree_model is None:
    print("⏭️ No tree model available to interpret.")
else:
    # Try SHAP first
    _has_shap = 'shap' in globals()
    if _has_shap:
        try:
            # Case A: XGBoost model → use TreeExplainer
            if 'xgb' in globals() and hasattr(xgb, 'XGBRegressor') and isinstance(_tree_model, xgb.XGBRegressor):
                print("🧭 Using SHAP TreeExplainer for XGBoost...")
                explainer = shap.TreeExplainer(_tree_model)
                shap_values = explainer.shap_values(X_test_tree)

                # Bar plot (mean |SHAP|)
                shap.summary_plot(
                    shap_values,
                    X_test_tree,
                    plot_type="bar",
                    show=False
                )
                plt.title('Feature Importance (SHAP, XGBoost)', fontsize=16, fontweight='bold')
                plt.tight_layout()
                plt.show()

                # Beeswarm
                shap.summary_plot(
                    shap_values,
                    X_test_tree,
                    show=False
                )
                plt.title('SHAP Summary (XGBoost)', fontsize=16, fontweight='bold')
                plt.tight_layout()
                plt.show()

            else:
                # Case B: scikit-learn tree model (e.g., HistGradientBoosting)
                print("🧭 Using SHAP generic Explainer for scikit-learn tree model...")
                explainer = shap.Explainer(_tree_model, X_train_tree)
                shap_values = explainer(X_test_tree)

                # Bar plot (mean |SHAP|)
                shap.plots.bar(shap_values, max_display=20, show=False)
                plt.title('Feature Importance (SHAP, TunedTree)', fontsize=16, fontweight='bold')
                plt.tight_layout()
                plt.show()

                # Beeswarm
                shap.plots.beeswarm(shap_values, max_display=20, show=False)
                plt.title('SHAP Summary (TunedTree)', fontsize=16, fontweight='bold')
                plt.tight_layout()
                plt.show()

        except Exception as e:
            print(f"⚠️ SHAP failed: {e}")
            _has_shap = False

    if not _has_shap:
        # Fallback: permutation importance
        print("⏭️ SHAP unavailable → using permutation importance fallback...")
        from sklearn.inspection import permutation_importance

        r = permutation_importance(_tree_model, X_test_tree, y_test_tree, n_repeats=20, random_state=42, n_jobs=-1)
        importances = pd.Series(r.importances_mean, index=X_test_tree.columns).sort_values(ascending=True)

        fig, ax = plt.subplots(figsize=(8, 8))
        ax.barh(importances.index[-25:], importances.values[-25:], color='#00D9FF', alpha=0.85)
        ax.set_title('Permutation Importance (Top 25 Features)', fontsize=16, fontweight='bold')
        ax.set_xlabel('Mean Importance (Δ RMSE)', fontsize=12)
        ax.grid(axis='x', alpha=0.3)
        plt.tight_layout()
        plt.show()

        print("💡 Interpretation: Higher bars = features that worsen error most when shuffled (more important). Sign indicates direction is not provided; use PDP for direction if needed.")
⚙️  Calculating feature attributions...
🧭 Using SHAP generic Explainer for scikit-learn tree model...

To showcase the model’s explanatory power on a granular level, we can generate SHAP waterfall plots for specific days in the test set. Let’s analyze the days with the lowest and highest HRV to see what drove those outcomes.

Explain individual predictions with SHAP waterfall plots
# Only run if we have SHAP and a tree model
if 'explainer' in globals() and 'shap_values' in globals():
    display(Markdown("### Deconstructing a Low HRV Day"))

    # Find the day with the lowest HRV in the test set
    worst_day_idx = y_test_tree.idxmin()
    
    # For XGBoost models, shap_values is a numpy array
    if isinstance(shap_values, np.ndarray):
        test_idx_position = X_test_tree.index.get_loc(worst_day_idx)
        worst_day_shap_values = shap_values[test_idx_position]
        
        print(f"Analyzing the lowest HRV day: {test_tree.loc[worst_day_idx, 'cycle_date'].date()} (HRV: {y_test_tree.loc[worst_day_idx]:.1f} ms)")
        
        shap.waterfall_plot(shap.Explanation(
            values=worst_day_shap_values,
            base_values=explainer.expected_value,
            data=X_test_tree.loc[worst_day_idx].values,
            feature_names=X_test_tree.columns.tolist()
        ))
    else:
        # For sklearn models, shap_values is an Explanation object
        worst_day_shap = shap_values[X_test_tree.index.get_loc(worst_day_idx)]
        
        print(f"Analyzing the lowest HRV day: {test_tree.loc[worst_day_idx, 'cycle_date'].date()} (HRV: {y_test_tree.loc[worst_day_idx]:.1f} ms)")
        
        shap.plots.waterfall(worst_day_shap, max_display=15, show=False)
        plt.tight_layout()
        plt.show()

    display(Markdown("### Deconstructing a High HRV Day"))

    # Find the day with the highest HRV in the test set
    best_day_idx = y_test_tree.idxmax()
    
    if isinstance(shap_values, np.ndarray):
        test_idx_position = X_test_tree.index.get_loc(best_day_idx)
        best_day_shap_values = shap_values[test_idx_position]
        
        print(f"Analyzing the highest HRV day: {test_tree.loc[best_day_idx, 'cycle_date'].date()} (HRV: {y_test_tree.loc[best_day_idx]:.1f} ms)")
        
        shap.waterfall_plot(shap.Explanation(
            values=best_day_shap_values,
            base_values=explainer.expected_value,
            data=X_test_tree.loc[best_day_idx].values,
            feature_names=X_test_tree.columns.tolist()
        ))
    else:
        best_day_shap = shap_values[X_test_tree.index.get_loc(best_day_idx)]
        
        print(f"Analyzing the highest HRV day: {test_tree.loc[best_day_idx, 'cycle_date'].date()} (HRV: {y_test_tree.loc[best_day_idx]:.1f} ms)")
        
        shap.plots.waterfall(best_day_shap, max_display=15, show=False)
        plt.tight_layout()
        plt.show()
else:
    print("⏭️ SHAP waterfall plots require SHAP values to be calculated first.")

Deconstructing a Low HRV Day

Analyzing the lowest HRV day: 2025-09-29 (HRV: 46.3 ms)

Deconstructing a High HRV Day

Analyzing the highest HRV day: 2025-10-02 (HRV: 100.9 ms)


Phase 4: Synthesis & Actionable Protocol

This final phase synthesizes the findings from both the Lasso and tree-based models and translates them into a practical, data-driven personal protocol to optimize recovery and performance.

4.1 Synthesis of Models: What Did We Learn?

By comparing our simple, interpretable Lasso model with the powerful, non-linear tree-based model, we can build a highly confident understanding of the key HRV drivers.

  • Strong Agreement: Both models overwhelmingly agree that resting_heart_rate_bpm and its trend rhr_deviation_from_7d_mean are the most dominant factors. This confirms that your baseline physiological state is the most critical input for next-day HRV.
  • Where Tree Models Shine: The non-linear tree model uncovered a crucial insight that Lasso missed: the high importance of cumulative_workout_minutes. It correctly identified this not as a source of fatigue, but as a proxy for fitness, where a higher chronic training load builds a more resilient system and a higher HRV ceiling.
  • The Power of Composite Features: Advanced features like the strain_recovery_balance_lag1 score and sleep_quality_score were consistently ranked as important by the tree model, proving that combining related metrics into a single, physiologically-aware feature provides a powerful signal.

4.2 Actionable Insights & Personal Protocol

The model’s findings can be distilled into a simple, actionable protocol. This moves the project from descriptive analysis to a prescriptive tool for self-improvement.

1. The RHR Deviation Rule (The “Red Flag” Detector): - Finding: The feature rhr_deviation_from_7d_mean is a powerful negative driver, signaling acute stress. - Protocol: If my morning RHR is 2+ bpm above my 7-day average, I will treat it as a signal of systemic fatigue or impending illness. That day’s training will be shifted to active recovery (e.g., walking, mobility), regardless of the original plan.

2. The Chronic Load Principle (The “Fitness Builder”): - Finding: cumulative_workout_minutes is a major positive driver of baseline HRV. - Protocol: The key to a higher HRV ceiling is not single heroic efforts, but consistency. The focus should be on maintaining a steady weekly training volume (total_workout_minutes_rolling7d_sum) to continuously build the fitness base.

3. The Physiological Bank Account (The “Daily Balance”): - Finding: The strain_recovery_balance_lag1 score was a clean and effective predictor in the tree model. - Protocol: After any day with a high-intensity workout (e.g., workout_intensity_exp in the top 25th percentile), I will make a conscious effort to maximize the “deposit” by prioritizing sleep duration and quality to offset the “withdrawal.”

By adhering to this data-driven protocol, the insights from this analysis can be directly applied to optimize daily well-being and long-term athletic adaptation.


Last updated: {python} from datetime import datetime; print(datetime.now().strftime(“%Y-%m-%d %H:%M”))