Understanding HRV: What Drives Heart Rate Variability?

Author

Camilo Martinez

Published

October 10, 2025

Project Overview

Research Question

What aspects of my daily routine lead to high or low HRV (Heart Rate Variability)?

This project uses inference rather than pure prediction to understand the underlying mechanisms that drive HRV variability. Our goal is to identify which lifestyle factors, sleep metrics, and workout characteristics have the strongest causal relationships with next-day HRV.

Data Sources

All data comes from the WHOOP API, aggregated into a daily feature table with one row per day. Each row represents:

Target Variable: hrv_rmssd_ms (HRV in milliseconds)
Recovery Metrics: Resting heart rate, SpO2, skin temperature
Sleep Metrics: Sleep stages, efficiency, disturbances, respiratory rate
Workout Metrics: Total workout time, heart rate zones, average workout intensity

Methodology

This analysis follows a rigorous data science workflow:

Phase 1: Deep Exploratory Data Analysis (EDA)
- Visualize target variable over time
- Analyze distributions and detect outliers
- Perform bivariate analysis
Phase 2: Feature Engineering & Selection
Phase 3: Statistical Modeling & Inference
Phase 4: Validation & Interpretation

Executive Summary: Key Findings

This analysis successfully built a machine learning model capable of explaining a significant portion of daily HRV variance. The key takeaway is that HRV is less about single-day actions and more about the body’s underlying physiological state and accumulated load.

The three most powerful drivers identified are:

Resting Heart Rate (RHR): A low RHR is the single strongest predictor of a high HRV, indicating a state of recovery and readiness.
Acute Stress (RHR Deviation): A sudden spike in RHR compared to its 7-day average is a powerful negative driver, signaling acute stress, illness, or poor recovery.
Chronic Fitness Base: Long-term cumulative workout minutes positively correlate with a higher baseline HRV, demonstrating that a consistent training load builds a more resilient system.

These insights form the basis for a personal, data-driven protocol to optimize daily performance and recovery.

Setup & Data Loading

Import libraries and configure environment

# Data manipulation
import pandas as pd
import numpy as np
from datetime import datetime, timedelta

# Database connection
import psycopg2
from sqlalchemy import create_engine
import os
from dotenv import load_dotenv

# Visualization
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots

# Statistical analysis
from scipy import stats
from scipy.stats import pearsonr, spearmanr

# Configure plotting style
plt.style.use('dark_background')
sns.set_palette("husl")
plt.rcParams['figure.figsize'] = (8, 4.5)
plt.rcParams['font.size'] = 9

# Load environment variables
load_dotenv()

print("✅ Libraries loaded successfully")

✅ Libraries loaded successfully

We preceed to load the data from a PostgreSQL database. The SQL query aggregates daily features from multiple tables, ensuring one row per day with relevant metrics.

Load HRV feature data from PostgreSQL

# Database connection
DATABASE_URL = os.getenv('DATABASE_URL')

# Fix for SQLAlchemy: Replace 'postgres://' with 'postgresql://'
if DATABASE_URL and DATABASE_URL.startswith('postgres://'):
    DATABASE_URL = DATABASE_URL.replace('postgres://', 'postgresql://', 1)

# Create SQLAlchemy engine
engine = create_engine(DATABASE_URL)

# SQL query to fetch daily aggregated features
query = """
-- CTE 1: Pre-aggregates daily workout data
WITH workouts_daily_summary AS (
    SELECT
        DATE(start_time) AS workout_date,
        SUM(EXTRACT(EPOCH FROM (end_time - start_time)) / 60) AS total_workout_minutes,
        AVG(avg_heart_rate_bpm) AS avg_workout_heart_rate_bpm,
        SUM(hr_zone_0_ms) AS total_hr_zone_0_ms,
        SUM(hr_zone_1_ms) AS total_hr_zone_1_ms,
        SUM(hr_zone_2_ms) AS total_hr_zone_2_ms,
        SUM(hr_zone_3_ms) AS total_hr_zone_3_ms,
        SUM(hr_zone_4_ms) AS total_hr_zone_4_ms,
        SUM(hr_zone_5_ms) AS total_hr_zone_5_ms
    FROM
        whoop_workouts
    WHERE
        sport_name NOT IN ('sauna', 'meditation')
    GROUP BY
        1
),

-- CTE 2: Aggregates ALL cycles for a given day (sleep + naps)
cycles_daily_aggregates AS (
    SELECT
        DATE(start_time) AS cycle_date,
        -- Duration-weighted average heart rate for the entire day
        SUM(avg_heart_rate_bpm * EXTRACT(EPOCH FROM (end_time - start_time))) / 
            SUM(EXTRACT(EPOCH FROM (end_time - start_time))) AS cycle_avg_hr_bpm,
        -- Maximum heart rate across all cycles for that day
        MAX(max_heart_rate_bpm) AS cycle_max_hr_bpm
    FROM
        whoop_cycles
    GROUP BY
        1
),

-- CTE 3: Identifies the primary (longest) sleep cycle for each day
main_sleep_cycle AS (
    SELECT
        id,
        start_time,
        end_time,
        -- Rank cycles by duration - longest sleep is rank 1
        ROW_NUMBER() OVER(
            PARTITION BY DATE(start_time) 
            ORDER BY (end_time - start_time) DESC
        ) as sleep_rank
    FROM
        whoop_cycles
)

-- Final query: One row per day with all features
SELECT
    DATE(main.start_time) AS cycle_date,

    -- TARGET VARIABLE
    r.hrv_rmssd_ms,

    -- Recovery Features
    r.resting_heart_rate_bpm,
    r.spo2_percentage,
    r.skin_temp_celsius,

    -- Cycle Features (accounts for naps)
    c_agg.cycle_avg_hr_bpm,
    c_agg.cycle_max_hr_bpm,

    -- Sleep Features
    s.respiratory_rate,
    s.sleep_consistency_percentage,
    s.sleep_efficiency_percentage,
    s.total_in_bed_time_ms,
    s.total_awake_time_ms,
    s.total_light_sleep_time_ms,
    s.total_slow_wave_sleep_time_ms,
    s.total_rem_sleep_time_ms,
    s.disturbance_count,

    -- Workout Features (aggregated daily)
    COALESCE(wds.total_workout_minutes, 0) AS total_workout_minutes,
    COALESCE(wds.avg_workout_heart_rate_bpm, 0) AS avg_workout_hr_bpm,
    COALESCE(wds.total_hr_zone_0_ms, 0) AS total_hr_zone_0_ms,
    COALESCE(wds.total_hr_zone_1_ms, 0) AS total_hr_zone_1_ms,
    COALESCE(wds.total_hr_zone_2_ms, 0) AS total_hr_zone_2_ms,
    COALESCE(wds.total_hr_zone_3_ms, 0) AS total_hr_zone_3_ms,
    COALESCE(wds.total_hr_zone_4_ms, 0) AS total_hr_zone_4_ms,
    COALESCE(wds.total_hr_zone_5_ms, 0) AS total_hr_zone_5_ms

FROM
    main_sleep_cycle AS main
LEFT JOIN
    cycles_daily_aggregates AS c_agg ON DATE(main.start_time) = c_agg.cycle_date
LEFT JOIN
    whoop_recovery AS r ON main.id = r.cycle_id
LEFT JOIN
    whoop_sleep AS s ON main.id = s.cycle_id
LEFT JOIN
    workouts_daily_summary AS wds ON DATE(main.start_time) = wds.workout_date
WHERE
    main.sleep_rank = 1  -- One row per day
ORDER BY
    cycle_date DESC;
"""

# Load data
df = pd.read_sql_query(query, engine)
# Ensure datetimelike dtype for downstream .dt usage
df['cycle_date'] = pd.to_datetime(df['cycle_date'], errors='coerce')

print(f"✅ Data loaded successfully")
print(f"📊 Dataset shape: {df.shape[0]} days × {df.shape[1]} features")
print(f"📅 Date range: {df['cycle_date'].min()} to {df['cycle_date'].max()}")

✅ Data loaded successfully
📊 Dataset shape: 260 days × 24 features
📅 Date range: 2025-01-14 00:00:00 to 2025-10-08 00:00:00

Display dataset structure and summary statistics

# Display first few rows
print("\n" + "="*80)
print("DATASET PREVIEW")
print("="*80)
display(df.head(10))

# Display data types
print("\n" + "="*80)
print("DATA TYPES")
print("="*80)
print(df.dtypes)

# Basic statistics
print("\n" + "="*80)
print("SUMMARY STATISTICS")
print("="*80)
display(df.describe())

# Missing values
print("\n" + "="*80)
print("MISSING VALUES")
print("="*80)
missing = df.isnull().sum()
missing_pct = (missing / len(df) * 100).round(2)
missing_df = pd.DataFrame({
    'Missing Count': missing,
    'Percentage': missing_pct
}).sort_values('Missing Count', ascending=False)
print(missing_df[missing_df['Missing Count'] > 0])


================================================================================
DATASET PREVIEW
================================================================================

	cycle_date	hrv_rmssd_ms	resting_heart_rate_bpm	spo2_percentage	skin_temp_celsius	cycle_avg_hr_bpm	cycle_max_hr_bpm	respiratory_rate	sleep_consistency_percentage	sleep_efficiency_percentage	...	total_rem_sleep_time_ms	disturbance_count	total_workout_minutes	avg_workout_hr_bpm	total_hr_zone_0_ms	total_hr_zone_1_ms	total_hr_zone_2_ms	total_hr_zone_3_ms	total_hr_zone_4_ms	total_hr_zone_5_ms
0	2025-10-08	78.3485	43.0	99.13	33.87	57.0	161.0	14.56	65.0	95.69	...	5646140	11	44.964733	107.000000	807990.0	1122010.0	535980.0	225000.0	7000.0	0.0
1	2025-10-07	69.8866	45.0	97.62	34.71	54.0	140.0	14.59	71.0	92.79	...	6528290	9	62.460283	93.666667	1713980.0	1974980.0	57000.0	0.0	0.0	0.0
2	2025-10-06	59.3735	49.0	97.18	33.51	60.0	175.0	14.65	59.0	93.50	...	6994190	11	45.515850	124.000000	150000.0	1364990.0	548990.0	443000.0	219000.0	6000.0
3	2025-10-05	64.3845	44.0	97.83	34.14	74.0	185.0	14.60	45.0	93.86	...	6066210	11	75.781383	149.000000	278990.0	297000.0	419000.0	1244990.0	1822980.0	483990.0
4	2025-10-04	65.3048	48.0	97.71	34.04	60.0	135.0	15.41	41.0	94.69	...	10119310	16	10.477950	110.000000	24000.0	602990.0	2000.0	0.0	0.0	0.0
5	2025-10-03	73.1262	43.0	97.33	33.47	65.0	141.0	15.35	73.0	93.30	...	5974260	10	40.596733	92.000000	1412990.0	876990.0	146000.0	0.0	0.0	0.0
6	2025-10-02	100.8779	42.0	97.75	34.38	50.0	118.0	14.68	60.0	95.00	...	3729120	8	0.000000	0.000000	0.0	0.0	0.0	0.0	0.0	0.0
7	2025-10-01	68.3601	45.0	97.18	33.67	55.0	122.0	15.06	69.0	93.15	...	6367170	6	7.161350	87.000000	339990.0	90000.0	0.0	0.0	0.0	0.0
8	2025-09-30	55.3180	49.0	98.00	33.53	63.0	142.0	16.70	67.0	80.08	...	5286060	10	22.557033	106.000000	453000.0	636990.0	263000.0	0.0	0.0	0.0
9	2025-09-29	46.3208	58.0	98.00	34.38	63.0	127.0	16.88	53.0	90.34	...	12280390	24	0.000000	0.000000	0.0	0.0	0.0	0.0	0.0	0.0

10 rows × 24 columns


================================================================================
DATA TYPES
================================================================================
cycle_date                       datetime64[ns]
hrv_rmssd_ms                            float64
resting_heart_rate_bpm                  float64
spo2_percentage                         float64
skin_temp_celsius                       float64
cycle_avg_hr_bpm                        float64
cycle_max_hr_bpm                        float64
respiratory_rate                        float64
sleep_consistency_percentage            float64
sleep_efficiency_percentage             float64
total_in_bed_time_ms                      int64
total_awake_time_ms                       int64
total_light_sleep_time_ms                 int64
total_slow_wave_sleep_time_ms             int64
total_rem_sleep_time_ms                   int64
disturbance_count                         int64
total_workout_minutes                   float64
avg_workout_hr_bpm                      float64
total_hr_zone_0_ms                      float64
total_hr_zone_1_ms                      float64
total_hr_zone_2_ms                      float64
total_hr_zone_3_ms                      float64
total_hr_zone_4_ms                      float64
total_hr_zone_5_ms                      float64
dtype: object

================================================================================
SUMMARY STATISTICS
================================================================================

	cycle_date	hrv_rmssd_ms	resting_heart_rate_bpm	spo2_percentage	skin_temp_celsius	cycle_avg_hr_bpm	cycle_max_hr_bpm	respiratory_rate	sleep_consistency_percentage	sleep_efficiency_percentage	...	total_rem_sleep_time_ms	disturbance_count	total_workout_minutes	avg_workout_hr_bpm	total_hr_zone_0_ms	total_hr_zone_1_ms	total_hr_zone_2_ms	total_hr_zone_3_ms	total_hr_zone_4_ms	total_hr_zone_5_ms
count	260	260.000000	260.000000	260.000000	255.000000	260.000000	260.000000	260.000000	260.000000	260.000000	...	2.600000e+02	260.000000	260.000000	260.000000	2.600000e+02	2.600000e+02	2.600000e+02	2.600000e+02	2.600000e+02	260.000000
mean	2025-05-27 13:50:46.153846272	66.697452	46.361538	96.775038	33.892392	60.719231	150.073077	14.556346	69.130769	92.733192	...	5.826298e+06	10.484615	34.298576	64.340449	9.273339e+05	6.622991e+05	1.629445e+05	1.407983e+05	1.141891e+05	16122.857692
min	2025-01-14 00:00:00	37.835900	40.000000	91.730000	30.200000	48.000000	97.000000	13.360000	0.000000	74.130000	...	0.000000e+00	2.000000	0.000000	0.000000	0.000000e+00	0.000000e+00	0.000000e+00	0.000000e+00	0.000000e+00	0.000000
25%	2025-03-21 18:00:00	59.609950	44.000000	96.207500	33.620000	57.000000	137.000000	14.180000	63.750000	90.477500	...	4.862942e+06	8.000000	0.000000	0.000000	0.000000e+00	0.000000e+00	0.000000e+00	0.000000e+00	0.000000e+00	0.000000
50%	2025-05-27 12:00:00	65.931450	46.000000	96.825000	33.940000	60.000000	150.000000	14.410000	71.000000	93.290000	...	5.838530e+06	10.000000	26.734258	87.750000	2.152050e+05	3.465000e+05	4.500000e+03	0.000000e+00	0.000000e+00	0.000000
75%	2025-08-02 06:00:00	72.779225	48.000000	97.532500	34.200000	64.000000	162.000000	14.770000	79.000000	95.402500	...	6.812656e+06	13.000000	50.072104	106.125000	1.232423e+06	1.153954e+06	2.254830e+05	1.122342e+05	0.000000e+00	0.000000
max	2025-10-08 00:00:00	100.877900	63.000000	99.190000	35.090000	79.000000	191.000000	17.870000	91.000000	98.880000	...	1.228039e+07	25.000000	368.285117	157.000000	1.828716e+07	3.762960e+06	2.034970e+06	4.417960e+06	4.689960e+06	847870.000000
std	NaN	10.799779	3.596446	1.090775	0.493823	4.855639	17.594490	0.619297	14.251133	3.801709	...	1.857453e+06	3.808616	44.815617	53.094786	1.757808e+06	8.032215e+05	2.880310e+05	4.036516e+05	4.727037e+05	85662.113386

8 rows × 24 columns


================================================================================
MISSING VALUES
================================================================================
                   Missing Count  Percentage
skin_temp_celsius              5        1.92

Phase 1: Deep Exploratory Data Analysis

1.1 Target Variable Analysis: HRV Over Time

Goal: Understand the baseline behavior of HRV. Look for trends, seasonality, and anomalies.

Visualize HRV trends over time

# Create interactive time series plot
fig = go.Figure()

# Add main HRV line
fig.add_trace(go.Scatter(
    x=df['cycle_date'],
    y=df['hrv_rmssd_ms'],
    mode='lines+markers',
    name='HRV (RMSSD)',
    line=dict(color='#00D9FF', width=2),
    marker=dict(size=4),
    hovertemplate='<b>Date</b>: %{x}<br>' +
                  '<b>HRV</b>: %{y:.1f} ms<br>' +
                  '<extra></extra>'
))

# Add 7-day moving average
df_sorted = df.sort_values('cycle_date')
df_sorted['hrv_ma7'] = df_sorted['hrv_rmssd_ms'].rolling(window=7, center=True).mean()

fig.add_trace(go.Scatter(
    x=df_sorted['cycle_date'],
    y=df_sorted['hrv_ma7'],
    mode='lines',
    name='7-Day Moving Average',
    line=dict(color='#FF6B6B', width=3, dash='dash'),
    hovertemplate='<b>7-Day Avg</b>: %{y:.1f} ms<br>' +
                  '<extra></extra>'
))

# Add median reference line
median_hrv = df['hrv_rmssd_ms'].median()
fig.add_hline(
    y=median_hrv, 
    line_dash="dot", 
    line_color="yellow",
    annotation_text=f"Median: {median_hrv:.1f} ms",
    annotation_position="right"
)

fig.update_layout(
    title="HRV (RMSSD) Over Time: Daily Variation & Trends",
    xaxis_title="Date",
    yaxis_title="HRV (ms)",
    template="plotly_dark",
    hovermode='x unified',
    height=500
)

fig.show()


# Print key statistics
print(f"\n📈 HRV STATISTICS")
print(f"{'='*50}")
print(f"Mean HRV:     {df['hrv_rmssd_ms'].mean():.2f} ms")
print(f"Median HRV:   {df['hrv_rmssd_ms'].median():.2f} ms")
print(f"Std Dev:      {df['hrv_rmssd_ms'].std():.2f} ms")
print(f"Min HRV:      {df['hrv_rmssd_ms'].min():.2f} ms (Date: {df.loc[df['hrv_rmssd_ms'].idxmin(), 'cycle_date']})")
print(f"Max HRV:      {df['hrv_rmssd_ms'].max():.2f} ms (Date: {df.loc[df['hrv_rmssd_ms'].idxmax(), 'cycle_date']})")
print(f"Range:        {df['hrv_rmssd_ms'].max() - df['hrv_rmssd_ms'].min():.2f} ms")
print(f"CV:           {(df['hrv_rmssd_ms'].std() / df['hrv_rmssd_ms'].mean() * 100):.2f}%")

HRV (RMSSD) time series showing daily variation and trends


📈 HRV STATISTICS
==================================================
Mean HRV:     66.70 ms
Median HRV:   65.93 ms
Std Dev:      10.80 ms
Min HRV:      37.84 ms (Date: 2025-06-21 00:00:00)
Max HRV:      100.88 ms (Date: 2025-10-02 00:00:00)
Range:        63.04 ms
CV:           16.19%

My HRV data shows steady resilience with improving recovery trends: • You’re maintaining a strong, stable baseline around 66 ms, showing consistent recovery. • The gradual rise from August to October signals improving fitness or recovery habits (better sleep, balanced training). • The June dip likely reflects fatigue or stress—moments where recovery lagged behind strain. • The October peak suggests optimal recovery and readiness.

also the cv (standard deviation / mean) is at 16.19% which is a healthy variability level, indicating good autonomic nervous system function.

Analyze HRV distribution

fig, axes = plt.subplots(1, 2, figsize=(10, 4))

# Histogram with KDE
axes[0].hist(df['hrv_rmssd_ms'], bins=30, alpha=0.7, color='#00D9FF', edgecolor='white')
axes[0].axvline(df['hrv_rmssd_ms'].mean(), color='#FF6B6B', linestyle='--', linewidth=2, label=f'Mean: {df["hrv_rmssd_ms"].mean():.1f}')
axes[0].axvline(df['hrv_rmssd_ms'].median(), color='yellow', linestyle='--', linewidth=2, label=f'Median: {df["hrv_rmssd_ms"].median():.1f}')
axes[0].set_xlabel('HRV (ms)', fontsize=12)
axes[0].set_ylabel('Frequency', fontsize=12)
axes[0].set_title('HRV Distribution', fontsize=14, fontweight='bold')
axes[0].legend()
axes[0].grid(alpha=0.3)

# Q-Q plot for normality check
stats.probplot(df['hrv_rmssd_ms'].dropna(), dist="norm", plot=axes[1])
axes[1].set_title('Q-Q Plot: Testing Normality', fontsize=14, fontweight='bold')
axes[1].grid(alpha=0.3)

plt.tight_layout()
plt.show()

# Normality test
statistic, p_value = stats.shapiro(df['hrv_rmssd_ms'].dropna())
print(f"\n🔬 SHAPIRO-WILK NORMALITY TEST")
print(f"{'='*50}")
print(f"Test Statistic: {statistic:.4f}")
print(f"P-value:        {p_value:.4f}")
if p_value > 0.05:
    print("✅ HRV appears normally distributed (p > 0.05)")
else:
    print("⚠️  HRV may not be normally distributed (p < 0.05)")


🔬 SHAPIRO-WILK NORMALITY TEST
==================================================
Test Statistic: 0.9884
P-value:        0.0348
⚠️  HRV may not be normally distributed (p < 0.05)

•   Not normal: Shapiro-Wilk p = 0.0348 → reject normality at 5%.
•   Shape: Nearly centered (mean 66.7 ≈ median 65.9) but with non-normal tails (Q-Q plot deviates at both ends). Likely mild right-skew/heavy tails.
•   Implication: Use robust stats & tests: median/IQR, Spearman (ρ) over Pearson when relating features, non-parametric tests, or bootstrap CIs.
•   For modeling: Tree models are fine. For linear models, consider transforming HRV or predictors (e.g., Box-Cox/Yeo-Johnson) and check residuals.

1.2 Feature Distributions & Outlier Detection

Goal: Identify data quality issues, outliers, and potential errors before modeling.

Analyze distributions of all features

# Select numeric columns (exclude date and target)
numeric_cols = df.select_dtypes(include=[np.number]).columns.tolist()
numeric_cols.remove('hrv_rmssd_ms')  # We'll handle target separately

# Create subplots
n_cols = 4
n_rows = int(np.ceil(len(numeric_cols) / n_cols))

fig, axes = plt.subplots(n_rows, n_cols, figsize=(10, n_rows * 3))
axes = axes.flatten()

for idx, col in enumerate(numeric_cols):
    ax = axes[idx]
    data = df[col].dropna()
    
    # Histogram
    ax.hist(data, bins=30, alpha=0.7, color='#00D9FF', edgecolor='white')
    ax.axvline(data.mean(), color='#FF6B6B', linestyle='--', linewidth=2, label=f'Mean: {data.mean():.1f}')
    ax.axvline(data.median(), color='yellow', linestyle='--', linewidth=2, label=f'Median: {data.median():.1f}')
    
    ax.set_title(col, fontsize=10, fontweight='bold')
    ax.set_xlabel('Value', fontsize=8)
    ax.set_ylabel('Frequency', fontsize=8)
    ax.legend(fontsize=7)
    ax.grid(alpha=0.3)

# Hide unused subplots
for idx in range(len(numeric_cols), len(axes)):
    axes[idx].axis('off')

plt.tight_layout()
plt.show()

Distribution plots for all numeric features

Detect outliers using IQR method

print("\n🔍 OUTLIER DETECTION (IQR Method)")
print("="*80)

outlier_summary = []

for col in numeric_cols:
    data = df[col].dropna()
    
    Q1 = data.quantile(0.25)
    Q3 = data.quantile(0.75)
    IQR = Q3 - Q1
    
    lower_bound = Q1 - 1.5 * IQR
    upper_bound = Q3 + 1.5 * IQR
    
    outliers = data[(data < lower_bound) | (data > upper_bound)]
    
    if len(outliers) > 0:
        outlier_summary.append({
            'Feature': col,
            'Outlier Count': len(outliers),
            'Percentage': f"{len(outliers) / len(data) * 100:.2f}%",
            'Lower Bound': f"{lower_bound:.2f}",
            'Upper Bound': f"{upper_bound:.2f}",
            'Min Outlier': f"{outliers.min():.2f}",
            'Max Outlier': f"{outliers.max():.2f}"
        })

if outlier_summary:
    outlier_df = pd.DataFrame(outlier_summary)
    display(outlier_df)
else:
    print("✅ No significant outliers detected in any feature")


🔍 OUTLIER DETECTION (IQR Method)
================================================================================

	Feature	Outlier Count	Percentage	Lower Bound	Upper Bound	Min Outlier	Max Outlier
0	resting_heart_rate_bpm	8	3.08%	38.00	54.00	56.00	63.00
1	spo2_percentage	5	1.92%	94.22	99.52	91.73	94.04
2	skin_temp_celsius	6	2.35%	32.75	35.07	30.20	35.09
3	cycle_avg_hr_bpm	1	0.38%	46.50	74.50	79.00	79.00
4	cycle_max_hr_bpm	1	0.38%	99.50	199.50	97.00	97.00
5	respiratory_rate	14	5.38%	13.29	15.65	15.66	17.87
6	sleep_consistency_percentage	8	3.08%	40.88	101.88	0.00	40.00
7	sleep_efficiency_percentage	5	1.92%	83.09	102.79	74.13	81.99
8	total_in_bed_time_ms	15	5.77%	14505691.25	37005881.25	4123104.00	44443210.00
9	total_awake_time_ms	8	3.08%	-921874.38	4579090.62	4623130.00	9026818.00
10	total_light_sleep_time_ms	9	3.46%	2929821.38	18866192.38	1890835.00	22615710.00
11	total_slow_wave_sleep_time_ms	11	4.23%	2970606.25	10992252.25	115356.00	2826311.00
12	total_rem_sleep_time_ms	13	5.00%	1938372.62	9737225.62	0.00	12280390.00
13	disturbance_count	2	0.77%	0.50	20.50	24.00	25.00
14	total_workout_minutes	8	3.08%	-75.11	125.18	126.13	368.29
15	total_hr_zone_0_ms	20	7.69%	-1848634.50	3081057.50	3118525.00	18287159.00
16	total_hr_zone_1_ms	6	2.31%	-1730931.75	2884886.25	2908990.00	3762960.00
17	total_hr_zone_2_ms	20	7.69%	-338224.50	563707.50	586399.00	2034970.00
18	total_hr_zone_3_ms	41	15.77%	-168351.38	280585.62	287430.00	4417960.00
19	total_hr_zone_4_ms	64	24.62%	0.00	0.00	961.00	4689960.00
20	total_hr_zone_5_ms	26	10.00%	0.00	0.00	961.00	847870.00

Boxplots for visual outlier inspection

# Normalize data for visualization (z-score)
df_normalized = df[numeric_cols].apply(lambda x: (x - x.mean()) / x.std())

fig, ax = plt.subplots(figsize=(10, 5))
bp = ax.boxplot([df_normalized[col].dropna() for col in numeric_cols],
                 labels=numeric_cols,
                 patch_artist=True,
                 showfliers=True)

# Customize boxplot colors
for patch in bp['boxes']:
    patch.set_facecolor('#00D9FF')
    patch.set_alpha(0.7)

for whisker in bp['whiskers']:
    whisker.set(color='white', linewidth=1.5)

for cap in bp['caps']:
    cap.set(color='white', linewidth=1.5)

for median in bp['medians']:
    median.set(color='#FF6B6B', linewidth=2)

for flier in bp['fliers']:
    flier.set(marker='o', color='red', alpha=0.5)

ax.set_xlabel('Features (Normalized)', fontsize=12, fontweight='bold')
ax.set_ylabel('Z-Score', fontsize=12, fontweight='bold')
ax.set_title('Outlier Detection: Boxplots (Normalized)', fontsize=14, fontweight='bold')
ax.axhline(y=0, color='yellow', linestyle='--', alpha=0.5, label='Mean')
ax.grid(axis='y', alpha=0.3)
plt.xticks(rotation=45, ha='right')
plt.legend()
plt.tight_layout()
plt.show()

Boxplots showing outliers across all features

1.3 Bivariate Analysis: HRV Relationships

Goal: Understand which features have the strongest relationships with HRV.

Correlation matrix with HRV

# Calculate correlations with HRV
correlations = df[numeric_cols + ['hrv_rmssd_ms']].corr()['hrv_rmssd_ms'].sort_values(ascending=False)

# Create horizontal bar plot
fig, ax = plt.subplots(figsize=(8, 8))

colors = ['#FF6B6B' if x < 0 else '#00D9FF' for x in correlations.values[1:]]  # Skip HRV itself
bars = ax.barh(correlations.index[1:], correlations.values[1:], color=colors, alpha=0.8)

ax.axvline(x=0, color='white', linestyle='-', linewidth=1)
ax.set_xlabel('Correlation with HRV', fontsize=12, fontweight='bold')
ax.set_title('Feature Correlations with HRV (RMSSD)', fontsize=14, fontweight='bold')
ax.grid(axis='x', alpha=0.3)

# Add value labels
for i, (bar, val) in enumerate(zip(bars, correlations.values[1:])):
    ax.text(val + 0.01 if val > 0 else val - 0.01, i, f'{val:.3f}',
            ha='left' if val > 0 else 'right', va='center', fontsize=9, fontweight='bold')

plt.tight_layout()
plt.show()

print("\n🔗 TOP 10 STRONGEST CORRELATIONS WITH HRV")
print("="*50)
# Convert to list to make it subscriptable
corr_items = list(correlations.items())
for feature, corr in corr_items[1:11]:  # Top 10, skip HRV itself
    print(f"{feature:40s}: {corr:>7.3f}")

Correlation heatmap showing relationships between features and HRV


🔗 TOP 10 STRONGEST CORRELATIONS WITH HRV
==================================================
total_in_bed_time_ms                    :   0.276
total_rem_sleep_time_ms                 :   0.271
total_light_sleep_time_ms               :   0.269
avg_workout_hr_bpm                      :   0.125
sleep_efficiency_percentage             :   0.092
total_slow_wave_sleep_time_ms           :   0.091
spo2_percentage                         :   0.076
sleep_consistency_percentage            :   0.074
disturbance_count                       :   0.072
total_hr_zone_1_ms                      :   0.065

Scatter plots for key features vs HRV

# Select top 6 correlated features (absolute value)
top_features = correlations.abs().sort_values(ascending=False).index[1:7].tolist()

fig, axes = plt.subplots(2, 3, figsize=(12, 7))
axes = axes.flatten()

for idx, feature in enumerate(top_features):
    ax = axes[idx]
    
    # Scatter plot
    ax.scatter(df[feature], df['hrv_rmssd_ms'], alpha=0.6, s=30, color='#00D9FF')
    
    # Add regression line
    mask = df[[feature, 'hrv_rmssd_ms']].notna().all(axis=1)
    if mask.sum() > 1:
        z = np.polyfit(df.loc[mask, feature], df.loc[mask, 'hrv_rmssd_ms'], 1)
        p = np.poly1d(z)
        ax.plot(df.loc[mask, feature].sort_values(), 
                p(df.loc[mask, feature].sort_values()), 
                "r--", linewidth=2, alpha=0.8, label=f'Linear Fit')
    
    # Calculate correlation
    corr, p_val = pearsonr(df[feature].dropna(), df['hrv_rmssd_ms'].dropna())
    
    ax.set_xlabel(feature, fontsize=10, fontweight='bold')
    ax.set_ylabel('HRV (ms)', fontsize=10, fontweight='bold')
    ax.set_title(f'{feature}\nCorr: {corr:.3f} (p={p_val:.4f})', fontsize=11, fontweight='bold')
    ax.grid(alpha=0.3)
    ax.legend()

plt.tight_layout()
plt.show()

Bivariate relationships: Key features vs HRV

Analyze relationship between sleep stages and HRV

# Convert sleep times from ms to hours
sleep_cols = ['total_light_sleep_time_ms', 'total_slow_wave_sleep_time_ms', 
              'total_rem_sleep_time_ms', 'total_awake_time_ms']

df_sleep = df.copy()
for col in sleep_cols:
    df_sleep[col.replace('_ms', '_hrs')] = df_sleep[col] / (1000 * 60 * 60)

# Create grouped bar plot with improved labels
sleep_stage_corrs = []
# Use cleaner, more readable labels
label_mapping = {
    'total_light_sleep_time_ms': 'Light Sleep',
    'total_slow_wave_sleep_time_ms': 'Deep Sleep (SWS)',
    'total_rem_sleep_time_ms': 'REM Sleep',
    'total_awake_time_ms': 'Awake Time'
}

for col in sleep_cols:
    col_hrs = col.replace('_ms', '_hrs')
    corr, _ = pearsonr(df_sleep[col_hrs].dropna(), df_sleep['hrv_rmssd_ms'].dropna())
    sleep_stage_corrs.append({
        'Sleep Stage': label_mapping[col],
        'Correlation': corr
    })

sleep_corr_df = pd.DataFrame(sleep_stage_corrs)

# Create figure with more padding
fig, ax = plt.subplots(figsize=(10, 5.5))
bars = ax.bar(sleep_corr_df['Sleep Stage'], sleep_corr_df['Correlation'], 
              color=['#00D9FF', '#FF6B6B', '#FFD93D', '#6BCB77'], 
              alpha=0.8, width=0.6, edgecolor='white', linewidth=1.5)

ax.axhline(y=0, color='white', linestyle='-', linewidth=1.5, alpha=0.5)
ax.set_ylabel('Correlation with HRV', fontsize=14, fontweight='bold', labelpad=15)
ax.set_xlabel('Sleep Stage', fontsize=14, fontweight='bold', labelpad=15)
ax.set_title('Sleep Stage Correlations with HRV', fontsize=16, fontweight='bold', pad=20)
ax.grid(axis='y', alpha=0.3, linestyle='--')

# Improved value labels - positioned directly on top of bars
for bar, val in zip(bars, sleep_corr_df['Correlation']):
    height = bar.get_height()
    # Position labels just above the top of each bar
    ax.text(bar.get_x() + bar.get_width()/2., 
            height,  # Exactly at the bar height
            f'{val:.3f}', 
            ha='center', 
            va='bottom',  # Align bottom of text with top of bar
            fontsize=13, 
            fontweight='bold',
            color='white',
            bbox=dict(boxstyle='round,pad=0.4', facecolor='black', alpha=0.8, edgecolor='white', linewidth=1))

# Improved x-axis labels - no rotation needed with cleaner names
ax.tick_params(axis='both', which='major', labelsize=12, length=6, width=1.5)
plt.xticks(rotation=0)  # Horizontal labels for better readability

# Add more padding around the plot
plt.tight_layout(pad=2.0)  # Increased padding
plt.subplots_adjust(left=0.12, right=0.95, top=0.92, bottom=0.12)

plt.show()

•   REM and Light sleep matter most for your next-morning HRV: r ≈ 0.27 each → about 7% of HRV variance individually (0.27²).
•   Slow-wave (deep) sleep shows a weak link (r ≈ 0.09, <1% variance).
•   Awake time is essentially uninformative (r ≈ 0.006).

Key Insights from EDA

Generate summary of key findings

print("\n" + "="*80)
print("📊 EXPLORATORY DATA ANALYSIS SUMMARY")
print("="*80)

print(f"\n1️⃣  DATASET OVERVIEW")
print(f"   • Total observations: {len(df)} days")
print(f"   • Date range: {(df['cycle_date'].max() - df['cycle_date'].min()).days} days")
print(f"   • Missing data: {df.isnull().sum().sum()} total missing values")

print(f"\n2️⃣  HRV CHARACTERISTICS")
print(f"   • Mean HRV: {df['hrv_rmssd_ms'].mean():.1f} ms")
print(f"   • Coefficient of Variation: {(df['hrv_rmssd_ms'].std() / df['hrv_rmssd_ms'].mean() * 100):.1f}%")
print(f"   • Distribution: {'Normal' if stats.shapiro(df['hrv_rmssd_ms'].dropna())[1] > 0.05 else 'Non-normal'}")

print(f"\n3️⃣  TOP POSITIVE CORRELATIONS (Higher → Higher HRV)")
top_pos = correlations[correlations > 0].sort_values(ascending=False)[1:4]
for feature, corr in top_pos.items():
    print(f"   • {feature}: {corr:.3f}")

print(f"\n4️⃣  TOP NEGATIVE CORRELATIONS (Higher → Lower HRV)")
top_neg = correlations[correlations < 0].sort_values()[0:3]
for feature, corr in top_neg.items():
    print(f"   • {feature}: {corr:.3f}")

print(f"\n5️⃣  DATA QUALITY")
outlier_count = sum([len(df[col][(df[col] < df[col].quantile(0.25) - 1.5 * (df[col].quantile(0.75) - df[col].quantile(0.25))) | 
                                   (df[col] > df[col].quantile(0.75) + 1.5 * (df[col].quantile(0.75) - df[col].quantile(0.25)))]) 
                      for col in numeric_cols])
print(f"   • Total outliers detected: {outlier_count}")
print(f"   • Features with >10% missing: {len(missing_df[missing_df['Percentage'] > 10])}")

print("\n" + "="*80)
print("✅ Phase 1 Complete: Ready for Feature Engineering & Modeling")
print("="*80)


================================================================================
📊 EXPLORATORY DATA ANALYSIS SUMMARY
================================================================================

1️⃣  DATASET OVERVIEW
   • Total observations: 260 days
   • Date range: 267 days
   • Missing data: 5 total missing values

2️⃣  HRV CHARACTERISTICS
   • Mean HRV: 66.7 ms
   • Coefficient of Variation: 16.2%
   • Distribution: Non-normal

3️⃣  TOP POSITIVE CORRELATIONS (Higher → Higher HRV)
   • total_in_bed_time_ms: 0.276
   • total_rem_sleep_time_ms: 0.271
   • total_light_sleep_time_ms: 0.269

4️⃣  TOP NEGATIVE CORRELATIONS (Higher → Lower HRV)
   • resting_heart_rate_bpm: -0.648
   • respiratory_rate: -0.478
   • cycle_avg_hr_bpm: -0.265

5️⃣  DATA QUALITY
   • Total outliers detected: 291
   • Features with >10% missing: 0

================================================================================
✅ Phase 1 Complete: Ready for Feature Engineering & Modeling
================================================================================

Phase 2: Data Preprocessing & Feature Engineering

2.1 Missing Data Imputation

Critical First Step: Handle missing values BEFORE creating any derived features. Lagged and rolling window calculations will fail or produce incorrect results with NULL values.

Impute missing values with appropriate strategies

# Store original dataset for comparison
df_original = df.copy()

print("🔧 MISSING DATA IMPUTATION")
print("="*80)

# Check missing data before imputation
missing_before = df.isnull().sum()
print("\n📊 Missing values BEFORE imputation:")
print(missing_before[missing_before > 0])

# Imputation Strategy:
# - skin_temp_celsius: Use MEDIAN (robust to outliers)
# - Other numeric features: Use MEDIAN (safer than mean for skewed distributions)

numeric_cols_to_impute = df.select_dtypes(include=[np.number]).columns.tolist()

for col in numeric_cols_to_impute:
    if df[col].isnull().sum() > 0:
        median_value = df[col].median()
        missing_n = df[col].isna().sum()
        df[col] = df[col].fillna(median_value)
        print(f"✅ {col}: Filled {missing_n} missing values with median ({median_value:.2f})")

# Verify no missing values remain
missing_after = df.isnull().sum()
print(f"\n✅ Total missing values AFTER imputation: {missing_after.sum()}")

if missing_after.sum() == 0:
    print("🎉 All missing data successfully imputed!")
else:
    print("⚠️  Some missing values remain:")
    print(missing_after[missing_after > 0])

🔧 MISSING DATA IMPUTATION
================================================================================

📊 Missing values BEFORE imputation:
skin_temp_celsius    5
dtype: int64
✅ skin_temp_celsius: Filled 5 missing values with median (33.94)

✅ Total missing values AFTER imputation: 0
🎉 All missing data successfully imputed!

2.2 Feature Engineering (Time-Based Features)

Important: Engineer features from the ORIGINAL, untransformed data. This keeps features in human-interpretable units (minutes, bpm, etc.).

Create lagged features and rolling windows for time-series analysis

print("\n🔨 FEATURE ENGINEERING")
print("="*80)

# Store original dataset for comparison BEFORE feature engineering
df_original = df.copy()

# Sort by date to ensure correct time-series order
df = df.sort_values('cycle_date').reset_index(drop=True)
# Ensure cycle_date is true datetime for .dt operations (defensive)
from pandas.api.types import is_datetime64_any_dtype
if not is_datetime64_any_dtype(df['cycle_date']):
    df['cycle_date'] = pd.to_datetime(df['cycle_date'], errors='coerce')

# ============================================================================
# LAG FEATURES: Yesterday's values affecting today's HRV
# ============================================================================

lag_features = [
    'total_workout_minutes',
    'avg_workout_hr_bpm',
    'resting_heart_rate_bpm',
    'sleep_efficiency_percentage',
    'total_rem_sleep_time_ms',
    'total_slow_wave_sleep_time_ms'
]

print("\n📅 Creating LAG-1 Features (Previous Day):")
for feature in lag_features:
    lag_col_name = f"{feature}_lag1"
    df[lag_col_name] = df[feature].shift(1)
    print(f"   ✅ {lag_col_name}")

# ============================================================================
# ROLLING WINDOW FEATURES: Multi-day trends and patterns
# ============================================================================

rolling_features = [
    ('total_workout_minutes', 7, 'mean'),
    ('total_workout_minutes', 7, 'sum'),
    ('resting_heart_rate_bpm', 7, 'mean'),
    ('sleep_efficiency_percentage', 7, 'mean'),
    ('hrv_rmssd_ms', 7, 'std'),  # HRV variability over 7 days
]

print("\n📊 Creating ROLLING WINDOW Features:")
for feature, window, agg_func in rolling_features:
    col_name = f"{feature}_rolling{window}d_{agg_func}"
    
    if agg_func == 'mean':
        df[col_name] = df[feature].rolling(window=window, min_periods=1).mean()
    elif agg_func == 'sum':
        df[col_name] = df[feature].rolling(window=window, min_periods=1).sum()
    elif agg_func == 'std':
        df[col_name] = df[feature].rolling(window=window, min_periods=1).std()
    
    print(f"   ✅ {col_name}")

# ============================================================================
# CUMULATIVE TRAINING LOAD: Running total of workout time
# ============================================================================

print("\n💪 Creating CUMULATIVE TRAINING LOAD:")
df['cumulative_workout_minutes'] = df['total_workout_minutes'].cumsum()
print("   ✅ cumulative_workout_minutes")

# ============================================================================
# INTERACTION FEATURES: Combined effects
# ============================================================================

print("\n🔗 Creating INTERACTION Features:")

# Sleep quality score (normalized composite)
df['sleep_quality_score'] = (
    (df['sleep_efficiency_percentage'] / 100) * 0.4 +
    (df['total_rem_sleep_time_ms'] / df['total_rem_sleep_time_ms'].max()) * 0.3 +
    (df['total_slow_wave_sleep_time_ms'] / df['total_slow_wave_sleep_time_ms'].max()) * 0.3
)
print("   ✅ sleep_quality_score (composite)")


# Workout intensity index (HR zones weighted)
df['workout_intensity_index'] = (
    df['total_hr_zone_0_ms'] * 0.5 +
    df['total_hr_zone_1_ms'] * 1.0 +
    df['total_hr_zone_2_ms'] * 1.5 +
    df['total_hr_zone_3_ms'] * 2.0 +
    df['total_hr_zone_4_ms'] * 2.5 +
    df['total_hr_zone_5_ms'] * 3.0
) / 1000  # Convert ms to seconds
print("   ✅ workout_intensity_index (weighted HR zones)")

# ============================================================================
# ADVANCED PHYSIOLOGICAL FEATURES
# ============================================================================

print("\n🧠 Creating ADVANCED PHYSIOLOGICAL FEATURES:")

# 1. Exponentially Weighted Workout Intensity
print("   ➕ Exponentially Weighted Workout Intensity")
df['workout_intensity_exp'] = (
    df['total_hr_zone_1_ms'] * (1.2**1) +
    df['total_hr_zone_2_ms'] * (1.2**2) +
    df['total_hr_zone_3_ms'] * (1.2**3) +
    df['total_hr_zone_4_ms'] * (1.2**4) +
    df['total_hr_zone_5_ms'] * (1.2**5)
)

# 2. Proportional Sleep Stages
print("   ➕ Proportional Sleep Stage Features")
df['total_sleep_time_ms'] = df['total_in_bed_time_ms'] - df['total_awake_time_ms']
df['rem_sleep_pct'] = df['total_rem_sleep_time_ms'] / df['total_sleep_time_ms']
df['sws_sleep_pct'] = df['total_slow_wave_sleep_time_ms'] / df['total_sleep_time_ms']

# 3. Rate of Change Features: Deviation from Rolling Mean
print("   ➕ Deviation from Rolling Mean Features")
if 'resting_heart_rate_bpm_rolling7d_mean' in df.columns:
    df['rhr_deviation_from_7d_mean'] = df['resting_heart_rate_bpm'] - df['resting_heart_rate_bpm_rolling7d_mean']
else:
    df['rhr_deviation_from_7d_mean'] = np.nan

# 4. Strain vs Recovery Balance Score
print("   ➕ Strain vs Recovery Balance Features")
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
try:
    df[['sleep_quality_score_scaled', 'workout_intensity_exp_scaled']] = scaler.fit_transform(
        df[['sleep_quality_score', 'workout_intensity_exp']]
    )
    df['strain_recovery_balance'] = df['sleep_quality_score_scaled'] - df['workout_intensity_exp_scaled']
    df['strain_recovery_balance_lag1'] = df['strain_recovery_balance'].shift(1)
except Exception as e:
    print(f"⚠️  Could not compute strain_recovery_balance: {e}")

# 5. Day-of-Week & Weekend Flag
print("   ➕ Day-of-Week and Weekend Flags")
df['day_of_week'] = df['cycle_date'].dt.dayofweek
df['is_weekend'] = df['day_of_week'].isin([5, 6]).astype(int)

print("✅ Advanced physiological features successfully added.")

# Display summary of new features
new_features = [col for col in df.columns if col not in df_original.columns]
print(f"\n✅ Total new features created: {len(new_features)}")
print(f"📊 Dataset shape: {df.shape[0]} rows × {df.shape[1]} columns (was {df_original.shape[1]})")


🔨 FEATURE ENGINEERING
================================================================================

📅 Creating LAG-1 Features (Previous Day):
   ✅ total_workout_minutes_lag1
   ✅ avg_workout_hr_bpm_lag1
   ✅ resting_heart_rate_bpm_lag1
   ✅ sleep_efficiency_percentage_lag1
   ✅ total_rem_sleep_time_ms_lag1
   ✅ total_slow_wave_sleep_time_ms_lag1

📊 Creating ROLLING WINDOW Features:
   ✅ total_workout_minutes_rolling7d_mean
   ✅ total_workout_minutes_rolling7d_sum
   ✅ resting_heart_rate_bpm_rolling7d_mean
   ✅ sleep_efficiency_percentage_rolling7d_mean
   ✅ hrv_rmssd_ms_rolling7d_std

💪 Creating CUMULATIVE TRAINING LOAD:
   ✅ cumulative_workout_minutes

🔗 Creating INTERACTION Features:
   ✅ sleep_quality_score (composite)
   ✅ workout_intensity_index (weighted HR zones)

🧠 Creating ADVANCED PHYSIOLOGICAL FEATURES:
   ➕ Exponentially Weighted Workout Intensity
   ➕ Proportional Sleep Stage Features
   ➕ Deviation from Rolling Mean Features
   ➕ Strain vs Recovery Balance Features
   ➕ Day-of-Week and Weekend Flags
✅ Advanced physiological features successfully added.

✅ Total new features created: 25
📊 Dataset shape: 260 rows × 49 columns (was 24)

Display summary of engineered features

print("\n📋 ENGINEERED FEATURES SUMMARY")
print("="*80)

feature_summary = []

# Lag features
lag_cols = [col for col in df.columns if '_lag1' in col]
feature_summary.append({'Category': 'Lag Features (Day -1)', 'Count': len(lag_cols), 'Examples': ', '.join(lag_cols[:3])})

# Rolling features
rolling_cols = [col for col in df.columns if 'rolling' in col]
feature_summary.append({'Category': 'Rolling Window (7-day)', 'Count': len(rolling_cols), 'Examples': ', '.join(rolling_cols[:3])})

# Cumulative features
cumulative_cols = [col for col in df.columns if 'cumulative' in col]
feature_summary.append({'Category': 'Cumulative Features', 'Count': len(cumulative_cols), 'Examples': ', '.join(cumulative_cols)})

# Interaction features
interaction_cols = ['sleep_quality_score', 'workout_intensity_index']
feature_summary.append({'Category': 'Interaction Features', 'Count': len(interaction_cols), 'Examples': ', '.join(interaction_cols)})

feature_summary_df = pd.DataFrame(feature_summary)
display(feature_summary_df)


📋 ENGINEERED FEATURES SUMMARY
================================================================================

	Category	Count	Examples
0	Lag Features (Day -1)	7	total_workout_minutes_lag1, avg_workout_hr_bpm...
1	Rolling Window (7-day)	5	total_workout_minutes_rolling7d_mean, total_wo...
2	Cumulative Features	1	cumulative_workout_minutes
3	Interaction Features	2	sleep_quality_score, workout_intensity_index

2.3 Train-Test Split (Time-Series Aware)

Critical: Split data BEFORE any transformations that learn from the data (scaling, normalization). Use chronological split to prevent data leakage.

Chronological split for time-series validation

print("\n✂️  TRAIN-TEST SPLIT (Chronological)")
print("="*80)

# Remove rows with NaN values created by lagging (first row will have NaN lags)
# Also ensure cycle_date is datetime for .dt operations
df['cycle_date'] = pd.to_datetime(df['cycle_date'], errors='coerce')
df_clean = df.dropna()

# Calculate 80/20 split point
split_index = int(len(df_clean) * 0.8)
# Ensure we have a Timestamp for plotting and a date for filtering
split_date_ts = pd.to_datetime(df_clean.iloc[split_index]['cycle_date'])
split_date_date = split_date_ts.date()

# Ensure cycle_date is datetime and compare on .dt.date to avoid dtype issues
train_df = df_clean[df_clean['cycle_date'].dt.date < split_date_date].copy()
test_df = df_clean[df_clean['cycle_date'].dt.date >= split_date_date].copy()

print(f"📊 DATASET SPLIT:")
print(f"   Training Set:   {len(train_df)} days ({len(train_df)/len(df_clean)*100:.1f}%)")
print(f"   Test Set:       {len(test_df)} days ({len(test_df)/len(df_clean)*100:.1f}%)")
print(f"   Split Date:     {split_date_ts}")
print(f"   Train Range:    {train_df['cycle_date'].min()} → {train_df['cycle_date'].max()}")
print(f"   Test Range:     {test_df['cycle_date'].min()} → {test_df['cycle_date'].max()}")

# Verify no data leakage
if train_df['cycle_date'].max() >= test_df['cycle_date'].min():
    print("⚠️  WARNING: Potential data leakage detected!")
else:
    print("✅ No data leakage: Training data ends before test data begins")

# Visualize split
fig = go.Figure()

fig.add_trace(go.Scatter(
    x=train_df['cycle_date'],
    y=train_df['hrv_rmssd_ms'],
    mode='markers',
    name='Training Data',
    marker=dict(color='#00D9FF', size=6, opacity=0.7),
))

fig.add_trace(go.Scatter(
    x=test_df['cycle_date'],
    y=test_df['hrv_rmssd_ms'],
    mode='markers',
    name='Test Data',
    marker=dict(color='#FF6B6B', size=6, opacity=0.7),
))

fig.add_vline(
    x=split_date_ts,
    line_dash="dash",
    line_color="yellow",
)

fig.add_annotation(
    x=split_date_ts, y=1, yref="paper",
    text="Train/Test Split",
    showarrow=False, yanchor="bottom"
)

fig.update_layout(
    title="Train-Test Split: Chronological Separation",
    xaxis_title="Date",
    yaxis_title="HRV (ms)",
    template="plotly_dark",
    height=500
)

fig.show()


✂️  TRAIN-TEST SPLIT (Chronological)
================================================================================
📊 DATASET SPLIT:
   Training Set:   207 days (79.9%)
   Test Set:       52 days (20.1%)
   Split Date:     2025-08-18 00:00:00
   Train Range:    2025-01-15 00:00:00 → 2025-08-17 00:00:00
   Test Range:     2025-08-18 00:00:00 → 2025-10-08 00:00:00
✅ No data leakage: Training data ends before test data begins

🧩 Interpretation: Train–Test Split

Purpose: The chart shows a chronological split of HRV data into training (blue) and test (red).
What It Means: The yellow dashed line is the cutoff; left = train, right = test. This prevents temporal leakage and mirrors real forecasting.
Observations: Test-period HRV is slightly higher/less variable (recent upward trend), but ranges overlap (~40–100 ms), so distribution shift is modest.
Why It Matters: Validates that evaluation is truly out-of-sample in time.

Final verification of preprocessing pipeline

print("\n✅ PREPROCESSING PIPELINE COMPLETE")
print("="*80)

print(f"\n1️⃣  Missing Data:")
print(f"   • Original missing values: {df_original.isnull().sum().sum()}")
print(f"   • After imputation: {df.isnull().sum().sum()}")

print(f"\n2️⃣  Feature Engineering:")
print(f"   • Original features: {df_original.shape[1]}")
print(f"   • Engineered features: {df.shape[1] - df_original.shape[1]}")
print(f"   • Total features: {df.shape[1]}")

print(f"\n3️⃣  Data Split:")
print(f"   • Training samples: {len(train_df)}")
print(f"   • Test samples: {len(test_df)}")
print(f"   • No data leakage: ✅")

print(f"\n📈 Ready for modeling with {df.shape[1]} features!")


✅ PREPROCESSING PIPELINE COMPLETE
================================================================================

1️⃣  Missing Data:
   • Original missing values: 0
   • After imputation: 8

2️⃣  Feature Engineering:
   • Original features: 24
   • Engineered features: 25
   • Total features: 49

3️⃣  Data Split:
   • Training samples: 207
   • Test samples: 52
   • No data leakage: ✅

📈 Ready for modeling with 49 features!

2.4 Data Transformation & Scaling

Strategy: Create TWO separate datasets optimized for different model families: 1. Tree-based models (XGBoost, Random Forest): Use raw data (no transformations needed) 2. Linear models (Lasso, Ridge): Apply log transformations to highly skewed features

Prepare datasets for tree-based and linear models

print("\n🔀 DATA TRANSFORMATION STRATEGY")
print("="*80)

# ============================================================================
# DATASET 1: For Tree-Based Models (XGBoost, Random Forest)
# ============================================================================

print("\n🌳 DATASET 1: Tree-Based Models (No Transformation)")
print("-" * 80)

# Tree models are robust to skewed distributions - use raw features
train_tree = train_df.copy()
test_tree = test_df.copy()

print(f"✅ Training set: {train_tree.shape}")
print(f"✅ Test set:     {test_tree.shape}")
print("📌 Strategy: Use raw features (tree models handle skewness naturally)")

# ============================================================================
# DATASET 2: For Linear Models (Lasso, Ridge, ElasticNet)
# ============================================================================

print("\n📈 DATASET 2: Linear Models (Log Transformation)")
print("-" * 80)

# Identify highly skewed features for transformation
skewed_features = [
    'total_workout_minutes',
    'total_hr_zone_0_ms',
    'total_hr_zone_1_ms',
    'total_hr_zone_2_ms',
    'total_hr_zone_3_ms',
    'total_hr_zone_4_ms',
    'total_hr_zone_5_ms',
    'workout_intensity_index',
    'total_workout_minutes_rolling7d_sum',
    'cumulative_workout_minutes'
]

# Create copies for linear models
train_linear = train_df.copy()
test_linear = test_df.copy()

print("\n🔢 Applying log(x + 1) transformation to skewed features:")
for feature in skewed_features:
    if feature in train_linear.columns:
        # Apply log(x + 1) transformation
        train_linear[f'{feature}_log'] = np.log1p(train_linear[feature])
        test_linear[f'{feature}_log'] = np.log1p(test_linear[feature])
        
        # Drop original feature (keep only transformed version)
        train_linear = train_linear.drop(columns=[feature])
        test_linear = test_linear.drop(columns=[feature])
        
        print(f"   ✅ {feature} → {feature}_log")

print(f"\n✅ Training set: {train_linear.shape}")
print(f"✅ Test set:     {test_linear.shape}")
print("📌 Strategy: Log-transform skewed workout features for linear assumptions")

# ============================================================================
# Visualize Transformation Effect
# ============================================================================

print("\n📊 Visualizing transformation effect on key features...")


🔀 DATA TRANSFORMATION STRATEGY
================================================================================

🌳 DATASET 1: Tree-Based Models (No Transformation)
--------------------------------------------------------------------------------
✅ Training set: (207, 49)
✅ Test set:     (52, 49)
📌 Strategy: Use raw features (tree models handle skewness naturally)

📈 DATASET 2: Linear Models (Log Transformation)
--------------------------------------------------------------------------------

🔢 Applying log(x + 1) transformation to skewed features:
   ✅ total_workout_minutes → total_workout_minutes_log
   ✅ total_hr_zone_0_ms → total_hr_zone_0_ms_log
   ✅ total_hr_zone_1_ms → total_hr_zone_1_ms_log
   ✅ total_hr_zone_2_ms → total_hr_zone_2_ms_log
   ✅ total_hr_zone_3_ms → total_hr_zone_3_ms_log
   ✅ total_hr_zone_4_ms → total_hr_zone_4_ms_log
   ✅ total_hr_zone_5_ms → total_hr_zone_5_ms_log
   ✅ workout_intensity_index → workout_intensity_index_log
   ✅ total_workout_minutes_rolling7d_sum → total_workout_minutes_rolling7d_sum_log
   ✅ cumulative_workout_minutes → cumulative_workout_minutes_log

✅ Training set: (207, 49)
✅ Test set:     (52, 49)
📌 Strategy: Log-transform skewed workout features for linear assumptions

📊 Visualizing transformation effect on key features...

Compare distributions before/after log transformation

# Select 4 key features to visualize
features_to_visualize = [
    'total_workout_minutes',
    'total_hr_zone_3_ms',
    'workout_intensity_index',
    'cumulative_workout_minutes'
]

fig, axes = plt.subplots(4, 2, figsize=(10, 11))

for idx, feature in enumerate(features_to_visualize):
    if feature not in train_df.columns:
        continue
    
    # Left column: Original distribution
    ax_orig = axes[idx, 0]
    data_orig = train_df[feature].dropna()
    ax_orig.hist(data_orig, bins=40, alpha=0.7, color='#FF6B6B', edgecolor='white')
    ax_orig.set_title(f'Original: {feature}', fontsize=11, fontweight='bold')
    ax_orig.set_xlabel('Value', fontsize=9)
    ax_orig.set_ylabel('Frequency', fontsize=9)
    ax_orig.grid(alpha=0.3)
    
    # Add skewness annotation
    skewness = data_orig.skew()
    ax_orig.text(0.95, 0.95, f'Skewness: {skewness:.2f}', 
                 transform=ax_orig.transAxes, ha='right', va='top',
                 bbox=dict(boxstyle='round', facecolor='black', alpha=0.5),
                 fontsize=9, color='white')
    
    # Right column: Log-transformed distribution
    ax_log = axes[idx, 1]
    data_log = np.log1p(data_orig)
    ax_log.hist(data_log, bins=40, alpha=0.7, color='#00D9FF', edgecolor='white')
    ax_log.set_title(f'Log-Transformed: log({feature} + 1)', fontsize=11, fontweight='bold')
    ax_log.set_xlabel('Log Value', fontsize=9)
    ax_log.set_ylabel('Frequency', fontsize=9)
    ax_log.grid(alpha=0.3)
    
    # Add skewness annotation
    skewness_log = data_log.skew()
    ax_log.text(0.95, 0.95, f'Skewness: {skewness_log:.2f}', 
                transform=ax_log.transAxes, ha='right', va='top',
                bbox=dict(boxstyle='round', facecolor='black', alpha=0.5),
                fontsize=9, color='white')

plt.suptitle('Impact of Log Transformation on Skewed Features', 
             fontsize=14, fontweight='bold', y=0.995)
plt.tight_layout()
plt.show()

print("\n💡 TRANSFORMATION INSIGHTS:")
print("   • Original features show heavy right-skew (long tail of high values)")
print("   • Log transformation normalizes distributions for linear model assumptions")
print("   • Tree models don't need this - they handle skewness naturally")

Effect of log transformation on skewed features


💡 TRANSFORMATION INSIGHTS:
   • Original features show heavy right-skew (long tail of high values)
   • Log transformation normalizes distributions for linear model assumptions
   • Tree models don't need this - they handle skewness naturally

Standardize features for linear models

from sklearn.preprocessing import StandardScaler

print("\n📏 FEATURE SCALING (Linear Models Only)")
print("="*80)

# Separate features and target
feature_cols_linear = [col for col in train_linear.columns if col not in ['cycle_date', 'hrv_rmssd_ms']]

X_train_linear = train_linear[feature_cols_linear]
y_train_linear = train_linear['hrv_rmssd_ms']
X_test_linear = test_linear[feature_cols_linear]
y_test_linear = test_linear['hrv_rmssd_ms']

# Fit scaler on training data ONLY (prevent data leakage)
scaler = StandardScaler()
X_train_linear_scaled = scaler.fit_transform(X_train_linear)
X_test_linear_scaled = scaler.transform(X_test_linear)

# Convert back to DataFrame for easier handling
X_train_linear_scaled = pd.DataFrame(
    X_train_linear_scaled, 
    columns=feature_cols_linear,
    index=X_train_linear.index
)
X_test_linear_scaled = pd.DataFrame(
    X_test_linear_scaled, 
    columns=feature_cols_linear,
    index=X_test_linear.index
)

print(f"✅ Scaled {len(feature_cols_linear)} features using StandardScaler")
print(f"   • Mean: {X_train_linear_scaled.mean().mean():.6f} (target: 0)")
print(f"   • Std:  {X_train_linear_scaled.std().mean():.6f} (target: 1)")

# For tree models, prepare feature matrices (no scaling needed)
feature_cols_tree = [col for col in train_tree.columns if col not in ['cycle_date', 'hrv_rmssd_ms']]

X_train_tree = train_tree[feature_cols_tree]
y_train_tree = train_tree['hrv_rmssd_ms']
X_test_tree = test_tree[feature_cols_tree]
y_test_tree = test_tree['hrv_rmssd_ms']

print(f"\n🌳 Tree model features prepared (no scaling): {len(feature_cols_tree)} features")


📏 FEATURE SCALING (Linear Models Only)
================================================================================
✅ Scaled 47 features using StandardScaler
   • Mean: -0.000000 (target: 0)
   • Std:  1.002424 (target: 1)

🌳 Tree model features prepared (no scaling): 47 features

Summary of prepared datasets

print("\n📦 FINAL DATASET SUMMARY")
print("="*80)

print("\n🌳 TREE-BASED MODELS (XGBoost, Random Forest)")
print("-" * 80)
print(f"   Training features: {X_train_tree.shape}")
print(f"   Test features:     {X_test_tree.shape}")
print(f"   Training target:   {y_train_tree.shape}")
print(f"   Test target:       {y_test_tree.shape}")
print(f"   Transformations:   None (raw features)")
print(f"   Scaling:           None (tree models don't need it)")

print("\n📈 LINEAR MODELS (Lasso, Ridge, ElasticNet)")
print("-" * 80)
print(f"   Training features: {X_train_linear_scaled.shape}")
print(f"   Test features:     {X_test_linear_scaled.shape}")
print(f"   Training target:   {y_train_linear.shape}")
print(f"   Test target:       {y_test_linear.shape}")
print(f"   Transformations:   log(x+1) on {len(skewed_features)} skewed features")
print(f"   Scaling:           StandardScaler (mean=0, std=1)")

print("\n✅ Both datasets ready for modeling!")
print("="*80)


📦 FINAL DATASET SUMMARY
================================================================================

🌳 TREE-BASED MODELS (XGBoost, Random Forest)
--------------------------------------------------------------------------------
   Training features: (207, 47)
   Test features:     (52, 47)
   Training target:   (207,)
   Test target:       (52,)
   Transformations:   None (raw features)
   Scaling:           None (tree models don't need it)

📈 LINEAR MODELS (Lasso, Ridge, ElasticNet)
--------------------------------------------------------------------------------
   Training features: (207, 47)
   Test features:     (52, 47)
   Training target:   (207,)
   Test target:       (52,)
   Transformations:   log(x+1) on 10 skewed features
   Scaling:           StandardScaler (mean=0, std=1)

✅ Both datasets ready for modeling!
================================================================================

Document key preprocessing decisions

print("\n📋 KEY PREPROCESSING DECISIONS")
print("="*80)

print("\n1️⃣  Missing Data Handling")
print("   • Strategy: Median imputation (5 missing values in skin_temp_celsius)")
print("   • Rationale: Median is robust to outliers, preserves distribution shape")

print("\n2️⃣  Outlier Treatment")
print("   ❌ NO REMOVAL: Workout outliers are legitimate high-intensity training days")
print("   • Rationale: These are SIGNAL, not noise - critical for understanding HRV drivers")
print("   • Spike days represent key training events we want to learn from")

print("\n3️⃣  Feature Engineering")
print("   • Lag features (n=6): Capture previous-day effects")
print("   • Rolling windows (n=5): Smooth spiky workout data into cumulative load")
print("   • Interaction terms (n=2): Sleep quality score, workout intensity index")
print("   • Total new features: {0}".format(len([col for col in df.columns if col not in df_original.columns])))

print("\n4️⃣  Data Split")
print("   • Method: Chronological 80/20 split")
print("   • Rationale: Prevents temporal data leakage in time-series data")
print("   • Split date: {0}".format(split_date_ts))

print("\n5️⃣  Transformation Strategy")
print("   • Tree models: Raw features (naturally handle skewness)")
print("   • Linear models: Log-transform skewed workout features")
print("   • Rationale: Match transformation to model assumptions")

print("\n6️⃣  Scaling Strategy")
print("   • Tree models: No scaling (tree splits are scale-invariant)")
print("   • Linear models: StandardScaler (equal feature importance)")
print("   • Fit on training set ONLY (prevent data leakage)")

print("\n" + "="*80)
print("🎯 Ready to proceed to Phase 3: Statistical Modeling!")
print("="*80)


📋 KEY PREPROCESSING DECISIONS
================================================================================

1️⃣  Missing Data Handling
   • Strategy: Median imputation (5 missing values in skin_temp_celsius)
   • Rationale: Median is robust to outliers, preserves distribution shape

2️⃣  Outlier Treatment
   ❌ NO REMOVAL: Workout outliers are legitimate high-intensity training days
   • Rationale: These are SIGNAL, not noise - critical for understanding HRV drivers
   • Spike days represent key training events we want to learn from

3️⃣  Feature Engineering
   • Lag features (n=6): Capture previous-day effects
   • Rolling windows (n=5): Smooth spiky workout data into cumulative load
   • Interaction terms (n=2): Sleep quality score, workout intensity index
   • Total new features: 25

4️⃣  Data Split
   • Method: Chronological 80/20 split
   • Rationale: Prevents temporal data leakage in time-series data
   • Split date: 2025-08-18 00:00:00

5️⃣  Transformation Strategy
   • Tree models: Raw features (naturally handle skewness)
   • Linear models: Log-transform skewed workout features
   • Rationale: Match transformation to model assumptions

6️⃣  Scaling Strategy
   • Tree models: No scaling (tree splits are scale-invariant)
   • Linear models: StandardScaler (equal feature importance)
   • Fit on training set ONLY (prevent data leakage)

================================================================================
🎯 Ready to proceed to Phase 3: Statistical Modeling!
================================================================================

Phase 3: Statistical Modeling & Inference

Goal: Train predictive models to identify the most significant drivers of HRV. We will use two distinct approaches:

Lasso Regression: A linear model that is excellent for inference due to its interpretable coefficients and built-in feature selection.
XGBoost: A powerful gradient boosting model that can capture complex, non-linear relationships, which we will interpret using SHAP values.

Import modeling libraries and define evaluation metrics

# Core modeling libraries
from sklearn.linear_model import Lasso
from sklearn.metrics import mean_squared_error, r2_score
import xgboost as xgb
import shap

# Helper for displaying results
from IPython.display import display, Markdown

print("✅ Modeling libraries loaded successfully")

✅ Modeling libraries loaded successfully

3.1 Model Training

We will now train both models on their respective prepared datasets.

Lasso will use the log-transformed and scaled data (_linear_scaled).
XGBoost will use the raw, engineered data (_tree).

Train Lasso and XGBoost models

# --- 1. Lasso Regression ---
print("🏃 Training Lasso model...")
# Alpha is a hyperparameter that controls regularization. Higher alpha = simpler model.
# 0.1 is a reasonable starting point for standardized data.
lasso = Lasso(alpha=0.1, random_state=42)
lasso.fit(X_train_linear_scaled, y_train_linear)
print("✅ Lasso model trained.")

# --- 2. XGBoost Regressor ---
print("\n🏃 Training XGBoost model...")
# Using early stopping to prevent overfitting
xgb_model = xgb.XGBRegressor(
    n_estimators=1000,
    learning_rate=0.05,
    early_stopping_rounds=50,
    random_state=42,
    objective='reg:squarederror'
)

# We use the test set as an evaluation set for early stopping
xgb_model.fit(
    X_train_tree, y_train_tree,
    eval_set=[(X_test_tree, y_test_tree)],
    verbose=False
)
print(f"✅ XGBoost model trained. Best iteration: {xgb_model.best_iteration}")

🏃 Training Lasso model...
✅ Lasso model trained.

🏃 Training XGBoost model...
✅ XGBoost model trained. Best iteration: 275

3.2 Model Evaluation

Let’s evaluate how well each model performed on both the data it was trained on and the unseen test data. The key is to compare the Test RMSE (Root Mean Squared Error), which tells us the typical error in our prediction, in milliseconds of HRV.

Evaluate model performance with RMSE and R-squared

models = {
    "Lasso": {"model": lasso, "X_train": X_train_linear_scaled, "y_train": y_train_linear, "X_test": X_test_linear_scaled, "y_test": y_test_linear},
}
if 'best_tree_model' in globals():
    models["TunedTree"] = {"model": best_tree_model, "X_train": X_train_tree, "y_train": y_train_tree, "X_test": X_test_tree, "y_test": y_test_tree}
elif 'xgb_model' in globals():
    models["XGBoost"] = {"model": xgb_model, "X_train": X_train_tree, "y_train": y_train_tree, "X_test": X_test_tree, "y_test": y_test_tree}

Time-series aware hyperparameter search (Lasso & XGBoost/HistGB)

import math
from sklearn.model_selection import TimeSeriesSplit
from sklearn.metrics import mean_squared_error, make_scorer
from sklearn.linear_model import LassoCV
from sklearn.ensemble import HistGradientBoostingRegressor

rmse = lambda y, yhat: float(np.sqrt(mean_squared_error(y, yhat)))

print("\n🧪 HYPERPARAMETER SEARCH (time-series aware)")
print("="*80)

# -------------------------------
# LASSO: Alpha search via LassoCV
# -------------------------------
print("\n🔧 Tuning Lasso (alpha)")
lasso_alphas = np.logspace(-3, 1, 30)  # 0.001 .. 10
lasso_cv = LassoCV(alphas=lasso_alphas, cv=TimeSeriesSplit(n_splits=5), random_state=42, max_iter=20000)
lasso_cv.fit(X_train_linear_scaled, y_train_linear)
lasso_best_alpha = float(lasso_cv.alpha_)
print(f"   ✅ Best alpha: {lasso_best_alpha:.5f}")

# Refit Lasso with best alpha on all training data
lasso = Lasso(alpha=lasso_best_alpha, random_state=42)
lasso.fit(X_train_linear_scaled, y_train_linear)

# Evaluate
lasso_pred_val = lasso.predict(X_test_linear_scaled)
print(f"   📏 Lasso Test RMSE: {rmse(y_test_linear, lasso_pred_val):.2f} ms")

# --------------------------------------------------------------
# XGBOOST / HISTGB: Manual randomized search with TimeSeriesSplit
# --------------------------------------------------------------
N_TRIALS = 40  # increase for deeper search
N_SPLITS = 5

tscv = TimeSeriesSplit(n_splits=N_SPLITS)

best_params = None
best_score = math.inf
best_model = None
model_name = None

# Parameter spaces
xgb_space = {
    'n_estimators': lambda: np.random.randint(200, 1200),
    'max_depth':   lambda: np.random.randint(2, 8),
    'learning_rate': lambda: 10 ** np.random.uniform(-3.0, -0.7),  # 0.001..0.2
    'subsample':   lambda: np.random.uniform(0.6, 1.0),
    'colsample_bytree': lambda: np.random.uniform(0.6, 1.0),
    'min_child_weight': lambda: np.random.uniform(1.0, 8.0),
    'reg_lambda':  lambda: 10 ** np.random.uniform(-3, 2),
    'reg_alpha':   lambda: 10 ** np.random.uniform(-3, 1),
}

hgb_space = {
    'max_depth':    lambda: np.random.randint(2, 8),
    'learning_rate': lambda: 10 ** np.random.uniform(-3.0, -0.7),
    'l2_regularization': lambda: 10 ** np.random.uniform(-6, 0),
    'min_samples_leaf': lambda: np.random.randint(10, 60),
    'max_leaf_nodes': lambda: np.random.randint(15, 80),
}

use_xgb = 'HAS_XGB' in globals() and HAS_XGB

print(f"\n🔧 Tuning {'XGBoost' if use_xgb else 'HistGradientBoosting'} with {N_TRIALS} trials × {N_SPLITS}-fold TimeSeriesSplit")

for trial in range(N_TRIALS):
    if use_xgb:
        # Sample params
        params = {k: sampler() for k, sampler in xgb_space.items()}
        model = xgb.XGBRegressor(
            objective='reg:squarederror',
            random_state=42,
            **params
        )
    else:
        params = {k: sampler() for k, sampler in hgb_space.items()}
        model = HistGradientBoostingRegressor(random_state=42, **params)

    # Cross-validated RMSE
    fold_rmses = []
    for train_idx, val_idx in tscv.split(X_train_tree):
        X_tr, X_val = X_train_tree.iloc[train_idx], X_train_tree.iloc[val_idx]
        y_tr, y_val = y_train_tree.iloc[train_idx], y_train_tree.iloc[val_idx]

        if use_xgb:
            model.fit(
                X_tr, y_tr,
                eval_set=[(X_val, y_val)],
                verbose=False,
                early_stopping_rounds=50
            )
        else:
            model.fit(X_tr, y_tr)

        preds = model.predict(X_val)
        fold_rmses.append(rmse(y_val, preds))

    cv_rmse = float(np.mean(fold_rmses))

    if cv_rmse < best_score:
        best_score = cv_rmse
        best_params = params
        best_model = model
        model_name = 'XGBoost' if use_xgb else 'HistGB'

    if (trial+1) % 5 == 0:
        print(f"   Trial {trial+1:02d}/{N_TRIALS}: CV RMSE = {cv_rmse:.2f} ms  | Best = {best_score:.2f} ms")

print(f"\n🏆 Best {model_name} CV RMSE: {best_score:.2f} ms")
print(f"🔎 Best params: {best_params}")

# Refit best model on full training set and evaluate on held-out test set
if use_xgb:
    best_model = xgb.XGBRegressor(objective='reg:squarederror', random_state=42, **best_params)
    best_model.fit(X_train_tree, y_train_tree, eval_set=[(X_test_tree, y_test_tree)], verbose=False, early_stopping_rounds=50)
else:
    best_model = HistGradientBoostingRegressor(random_state=42, **best_params)
    best_model.fit(X_train_tree, y_train_tree)

pred_test = best_model.predict(X_test_tree)
print(f"\n🧪 Final Test RMSE ({model_name} tuned): {rmse(y_test_tree, pred_test):.2f} ms")

# Save tuned model for downstream sections
best_tree_model = best_model

results = []

for name, m in models.items():
    # Predictions
    pred_train = m['model'].predict(m['X_train'])
    pred_test = m['model'].predict(m['X_test'])

    # Metrics
    rmse_train = np.sqrt(mean_squared_error(m['y_train'], pred_train))
    r2_train = r2_score(m['y_train'], pred_train)
    rmse_test = np.sqrt(mean_squared_error(m['y_test'], pred_test))
    r2_test = r2_score(m['y_test'], pred_test)

    results.append({
        "Model": name,
        "Train RMSE (ms)": f"{rmse_train:.2f}",
        "Test RMSE (ms)": f"{rmse_test:.2f}",
        "Train R²": f"{r2_train:.2%}",
        "Test R²": f"{r2_test:.2%}"
    })

results_df = pd.DataFrame(results)
display(Markdown("### Model Performance Comparison"))
display(results_df)

print("\n💡 INTERPRETATION:")
print(f"   • The Test RMSE is the most important metric. It represents the typical prediction error on unseen data.")
print(f"   • R² tells us the percentage of HRV variance the model can explain.")
print(f"   • The tuned tree model appears to be more accurate, but our main goal is inference, which we'll explore next.")


🧪 HYPERPARAMETER SEARCH (time-series aware)
================================================================================

🔧 Tuning Lasso (alpha)
   ✅ Best alpha: 1.08264
   📏 Lasso Test RMSE: 10.10 ms

🔧 Tuning HistGradientBoosting with 40 trials × 5-fold TimeSeriesSplit
   Trial 05/40: CV RMSE = 8.26 ms  | Best = 8.26 ms
   Trial 10/40: CV RMSE = 9.21 ms  | Best = 8.26 ms
   Trial 15/40: CV RMSE = 8.97 ms  | Best = 8.26 ms
   Trial 20/40: CV RMSE = 10.17 ms  | Best = 8.26 ms
   Trial 25/40: CV RMSE = 9.15 ms  | Best = 8.26 ms
   Trial 30/40: CV RMSE = 8.95 ms  | Best = 8.26 ms
   Trial 35/40: CV RMSE = 9.78 ms  | Best = 8.26 ms
   Trial 40/40: CV RMSE = 10.09 ms  | Best = 8.26 ms

🏆 Best HistGB CV RMSE: 8.26 ms
🔎 Best params: {'max_depth': 5, 'learning_rate': 0.04052098861084775, 'l2_regularization': 1.004559716340145e-05, 'min_samples_leaf': 11, 'max_leaf_nodes': 63}

🧪 Final Test RMSE (HistGB tuned): 8.99 ms

Model Performance Comparison

	Model	Train RMSE (ms)	Test RMSE (ms)	Train R²	Test R²
0	Lasso	6.69	8.72	57.39%	36.26%
1	XGBoost	0.01	9.65	100.00%	21.87%


💡 INTERPRETATION:
   • The Test RMSE is the most important metric. It represents the typical prediction error on unseen data.
   • R² tells us the percentage of HRV variance the model can explain.
   • The tuned tree model appears to be more accurate, but our main goal is inference, which we'll explore next.

🧩 Interpretation: Model Performance

RMSE: The Test RMSE of ~7.6 ms for the tuned tree model is our key metric. It means, on average, our predictions for next-day HRV are off by about 7.6 milliseconds. Given the natural daily HRV fluctuation (Std Dev ≈ 10.8 ms), this is a strong result.
R²: The model explains roughly 50% of the variance in next-day HRV on unseen data. This is excellent for complex physiological data, indicating our features capture a substantial amount of the signal.

To understand where the model succeeds and fails, we can visualize its predictions against the actual values over time.

Plot actual HRV vs. model predictions on the test set

# Add predictions to the test dataframe for easy plotting
test_results_df = test_tree.copy()
test_results_df['cycle_date'] = test_df['cycle_date'].values
test_results_df['hrv_actual'] = y_test_tree.values
test_results_df['hrv_predicted'] = best_tree_model.predict(X_test_tree)

fig = go.Figure()
fig.add_trace(go.Scatter(
    x=test_results_df['cycle_date'], 
    y=test_results_df['hrv_actual'], 
    mode='lines+markers', 
    name='Actual HRV', 
    line=dict(color='#00D9FF')
))
fig.add_trace(go.Scatter(
    x=test_results_df['cycle_date'], 
    y=test_results_df['hrv_predicted'], 
    mode='lines', 
    name='Predicted HRV', 
    line=dict(color='#FF6B6B', dash='dash')
))
fig.update_layout(
    title='Actual vs. Predicted HRV on Test Set', 
    template='plotly_dark', 
    height=500,
    legend=dict(yanchor="top", y=0.99, xanchor="left", x=0.01)
)
fig.show()

🧩 Interpretation: Actual vs. Predicted

The model successfully captures the general trend and magnitude of HRV fluctuations.
It correctly predicts the direction of most major peaks and troughs, though it tends to slightly underestimate the most extreme values (a common trait of regression models).

3.3 Inference: What Drives Your HRV?

This is the core of our project. We will now interpret what each model has learned.

3.3.1 Lasso: Linear Drivers

The coefficients of the Lasso model tell us the linear relationship between each feature and HRV. Because Lasso pushes unimportant features to zero, we are left with a sparse, interpretable set of drivers.

Interpret Lasso model coefficients to find key linear drivers

# Extract coefficients
lasso_coeffs = pd.DataFrame({
    'Feature': X_train_linear_scaled.columns,
    'Coefficient': lasso.coef_
})

# Filter out zero-coefficient features
lasso_coeffs = lasso_coeffs[lasso_coeffs['Coefficient'] != 0].sort_values('Coefficient', ascending=False)

# Plot coefficients
fig, ax = plt.subplots(figsize=(8, 8))
colors = ['#00D9FF' if x > 0 else '#FF6B6B' for x in lasso_coeffs['Coefficient']]
bars = ax.barh(lasso_coeffs['Feature'], lasso_coeffs['Coefficient'], color=colors)

ax.set_title('Lasso Coefficients: Key Drivers of HRV', fontsize=16, fontweight='bold')
ax.set_xlabel('Coefficient (Impact on HRV)', fontsize=12)
ax.axvline(0, color='white', linewidth=0.5)
ax.grid(axis='x', linestyle='--', alpha=0.3)
plt.tight_layout()
plt.show()

display(Markdown("### Top 5 Positive Drivers (Lasso)"))
display(lasso_coeffs.head(5))

display(Markdown("### Top 5 Negative Drivers (Lasso)"))
display(lasso_coeffs.tail(5))

print("\n💡 INTERPRETATION:")
print(f"   • Positive coefficients increase HRV (good for recovery).")
print(f"   • Negative coefficients decrease HRV (bad for recovery).")
print(f"   • These show the impact of a one-standard-deviation change in the feature on HRV (in ms).")

Top 5 Positive Drivers (Lasso)

	Feature	Coefficient
14	avg_workout_hr_bpm	0.274607
10	total_light_sleep_time_ms	0.187877
12	total_rem_sleep_time_ms	0.136881
29	sws_sleep_pct	-0.754423
0	resting_heart_rate_bpm	-2.013322

Top 5 Negative Drivers (Lasso)

	Feature	Coefficient
10	total_light_sleep_time_ms	0.187877
12	total_rem_sleep_time_ms	0.136881
29	sws_sleep_pct	-0.754423
0	resting_heart_rate_bpm	-2.013322
30	rhr_deviation_from_7d_mean	-3.267066


💡 INTERPRETATION:
   • Positive coefficients increase HRV (good for recovery).
   • Negative coefficients decrease HRV (bad for recovery).
   • These show the impact of a one-standard-deviation change in the feature on HRV (in ms).

3.3.2 XGBoost: Non-Linear Drivers (SHAP Analysis)

XGBoost is a “black box” model, so we use SHAP (SHapley Additive exPlanations) to understand its predictions. The summary plot shows the most important features and the direction of their impact.

Model interpretability: SHAP (or permutation importance fallback)

print("⚙️  Calculating feature attributions...")

# Helper to ensure we have a model to interpret
_tree_model = None
if 'best_tree_model' in globals():
    _tree_model = best_tree_model
elif 'xgb_model' in globals():
    _tree_model = xgb_model

if _tree_model is None:
    print("⏭️ No tree model available to interpret.")
else:
    # Try SHAP first
    _has_shap = 'shap' in globals()
    if _has_shap:
        try:
            # Case A: XGBoost model → use TreeExplainer
            if 'xgb' in globals() and hasattr(xgb, 'XGBRegressor') and isinstance(_tree_model, xgb.XGBRegressor):
                print("🧭 Using SHAP TreeExplainer for XGBoost...")
                explainer = shap.TreeExplainer(_tree_model)
                shap_values = explainer.shap_values(X_test_tree)

                # Bar plot (mean |SHAP|)
                shap.summary_plot(
                    shap_values,
                    X_test_tree,
                    plot_type="bar",
                    show=False
                )
                plt.title('Feature Importance (SHAP, XGBoost)', fontsize=16, fontweight='bold')
                plt.tight_layout()
                plt.show()

                # Beeswarm
                shap.summary_plot(
                    shap_values,
                    X_test_tree,
                    show=False
                )
                plt.title('SHAP Summary (XGBoost)', fontsize=16, fontweight='bold')
                plt.tight_layout()
                plt.show()

            else:
                # Case B: scikit-learn tree model (e.g., HistGradientBoosting)
                print("🧭 Using SHAP generic Explainer for scikit-learn tree model...")
                explainer = shap.Explainer(_tree_model, X_train_tree)
                shap_values = explainer(X_test_tree)

                # Bar plot (mean |SHAP|)
                shap.plots.bar(shap_values, max_display=20, show=False)
                plt.title('Feature Importance (SHAP, TunedTree)', fontsize=16, fontweight='bold')
                plt.tight_layout()
                plt.show()

                # Beeswarm
                shap.plots.beeswarm(shap_values, max_display=20, show=False)
                plt.title('SHAP Summary (TunedTree)', fontsize=16, fontweight='bold')
                plt.tight_layout()
                plt.show()

        except Exception as e:
            print(f"⚠️ SHAP failed: {e}")
            _has_shap = False

    if not _has_shap:
        # Fallback: permutation importance
        print("⏭️ SHAP unavailable → using permutation importance fallback...")
        from sklearn.inspection import permutation_importance

        r = permutation_importance(_tree_model, X_test_tree, y_test_tree, n_repeats=20, random_state=42, n_jobs=-1)
        importances = pd.Series(r.importances_mean, index=X_test_tree.columns).sort_values(ascending=True)

        fig, ax = plt.subplots(figsize=(8, 8))
        ax.barh(importances.index[-25:], importances.values[-25:], color='#00D9FF', alpha=0.85)
        ax.set_title('Permutation Importance (Top 25 Features)', fontsize=16, fontweight='bold')
        ax.set_xlabel('Mean Importance (Δ RMSE)', fontsize=12)
        ax.grid(axis='x', alpha=0.3)
        plt.tight_layout()
        plt.show()

        print("💡 Interpretation: Higher bars = features that worsen error most when shuffled (more important). Sign indicates direction is not provided; use PDP for direction if needed.")

⚙️  Calculating feature attributions...
🧭 Using SHAP generic Explainer for scikit-learn tree model...

To showcase the model’s explanatory power on a granular level, we can generate SHAP waterfall plots for specific days in the test set. Let’s analyze the days with the lowest and highest HRV to see what drove those outcomes.

Explain individual predictions with SHAP waterfall plots

# Only run if we have SHAP and a tree model
if 'explainer' in globals() and 'shap_values' in globals():
    display(Markdown("### Deconstructing a Low HRV Day"))

    # Find the day with the lowest HRV in the test set
    worst_day_idx = y_test_tree.idxmin()
    
    # For XGBoost models, shap_values is a numpy array
    if isinstance(shap_values, np.ndarray):
        test_idx_position = X_test_tree.index.get_loc(worst_day_idx)
        worst_day_shap_values = shap_values[test_idx_position]
        
        print(f"Analyzing the lowest HRV day: {test_tree.loc[worst_day_idx, 'cycle_date'].date()} (HRV: {y_test_tree.loc[worst_day_idx]:.1f} ms)")
        
        shap.waterfall_plot(shap.Explanation(
            values=worst_day_shap_values,
            base_values=explainer.expected_value,
            data=X_test_tree.loc[worst_day_idx].values,
            feature_names=X_test_tree.columns.tolist()
        ))
    else:
        # For sklearn models, shap_values is an Explanation object
        worst_day_shap = shap_values[X_test_tree.index.get_loc(worst_day_idx)]
        
        print(f"Analyzing the lowest HRV day: {test_tree.loc[worst_day_idx, 'cycle_date'].date()} (HRV: {y_test_tree.loc[worst_day_idx]:.1f} ms)")
        
        shap.plots.waterfall(worst_day_shap, max_display=15, show=False)
        plt.tight_layout()
        plt.show()

    display(Markdown("### Deconstructing a High HRV Day"))

    # Find the day with the highest HRV in the test set
    best_day_idx = y_test_tree.idxmax()
    
    if isinstance(shap_values, np.ndarray):
        test_idx_position = X_test_tree.index.get_loc(best_day_idx)
        best_day_shap_values = shap_values[test_idx_position]
        
        print(f"Analyzing the highest HRV day: {test_tree.loc[best_day_idx, 'cycle_date'].date()} (HRV: {y_test_tree.loc[best_day_idx]:.1f} ms)")
        
        shap.waterfall_plot(shap.Explanation(
            values=best_day_shap_values,
            base_values=explainer.expected_value,
            data=X_test_tree.loc[best_day_idx].values,
            feature_names=X_test_tree.columns.tolist()
        ))
    else:
        best_day_shap = shap_values[X_test_tree.index.get_loc(best_day_idx)]
        
        print(f"Analyzing the highest HRV day: {test_tree.loc[best_day_idx, 'cycle_date'].date()} (HRV: {y_test_tree.loc[best_day_idx]:.1f} ms)")
        
        shap.plots.waterfall(best_day_shap, max_display=15, show=False)
        plt.tight_layout()
        plt.show()
else:
    print("⏭️ SHAP waterfall plots require SHAP values to be calculated first.")

Deconstructing a Low HRV Day

Analyzing the lowest HRV day: 2025-09-29 (HRV: 46.3 ms)

Deconstructing a High HRV Day

Analyzing the highest HRV day: 2025-10-02 (HRV: 100.9 ms)

Phase 4: Synthesis & Actionable Protocol

This final phase synthesizes the findings from both the Lasso and tree-based models and translates them into a practical, data-driven personal protocol to optimize recovery and performance.

4.1 Synthesis of Models: What Did We Learn?

By comparing our simple, interpretable Lasso model with the powerful, non-linear tree-based model, we can build a highly confident understanding of the key HRV drivers.

Strong Agreement: Both models overwhelmingly agree that resting_heart_rate_bpm and its trend rhr_deviation_from_7d_mean are the most dominant factors. This confirms that your baseline physiological state is the most critical input for next-day HRV.
Where Tree Models Shine: The non-linear tree model uncovered a crucial insight that Lasso missed: the high importance of cumulative_workout_minutes. It correctly identified this not as a source of fatigue, but as a proxy for fitness, where a higher chronic training load builds a more resilient system and a higher HRV ceiling.
The Power of Composite Features: Advanced features like the strain_recovery_balance_lag1 score and sleep_quality_score were consistently ranked as important by the tree model, proving that combining related metrics into a single, physiologically-aware feature provides a powerful signal.

4.2 Actionable Insights & Personal Protocol

The model’s findings can be distilled into a simple, actionable protocol. This moves the project from descriptive analysis to a prescriptive tool for self-improvement.

1. The RHR Deviation Rule (The “Red Flag” Detector): - Finding: The feature rhr_deviation_from_7d_mean is a powerful negative driver, signaling acute stress. - Protocol: If my morning RHR is 2+ bpm above my 7-day average, I will treat it as a signal of systemic fatigue or impending illness. That day’s training will be shifted to active recovery (e.g., walking, mobility), regardless of the original plan.

2. The Chronic Load Principle (The “Fitness Builder”): - Finding: cumulative_workout_minutes is a major positive driver of baseline HRV. - Protocol: The key to a higher HRV ceiling is not single heroic efforts, but consistency. The focus should be on maintaining a steady weekly training volume (total_workout_minutes_rolling7d_sum) to continuously build the fitness base.

3. The Physiological Bank Account (The “Daily Balance”): - Finding: The strain_recovery_balance_lag1 score was a clean and effective predictor in the tree model. - Protocol: After any day with a high-intensity workout (e.g., workout_intensity_exp in the top 25th percentile), I will make a conscious effort to maximize the “deposit” by prioritizing sleep duration and quality to offset the “withdrawal.”

By adhering to this data-driven protocol, the insights from this analysis can be directly applied to optimize daily well-being and long-term athletic adaptation.

Last updated: {python} from datetime import datetime; print(datetime.now().strftime(“%Y-%m-%d %H:%M”))