---
title: "Understanding HRV: What Drives Heart Rate Variability?"
author: "Camilo Martinez"
date: today
format:
html:
code-fold: true
code-tools: true
toc: true
toc-depth: 3
theme: darkly
embed-resources: true
execute:
warning: false
message: false
jupyter: python3
---
# Project Overview
## Research Question
**What aspects of my daily routine lead to high or low HRV (Heart Rate Variability)?**
This project uses **inference** rather than pure prediction to understand the underlying mechanisms that drive HRV variability. Our goal is to identify which lifestyle factors, sleep metrics, and workout characteristics have the strongest causal relationships with next-day HRV.
## Data Sources
All data comes from the **WHOOP API**, aggregated into a daily feature table with one row per day. Each row represents:
- **Target Variable**: `hrv_rmssd_ms` (HRV in milliseconds)
- **Recovery Metrics**: Resting heart rate, SpO2, skin temperature
- **Sleep Metrics**: Sleep stages, efficiency, disturbances, respiratory rate
- **Workout Metrics**: Total workout time, heart rate zones, average workout intensity
## Methodology
This analysis follows a rigorous data science workflow:
1. **Phase 1: Deep Exploratory Data Analysis (EDA)**
- Visualize target variable over time
- Analyze distributions and detect outliers
- Perform bivariate analysis
2. **Phase 2: Feature Engineering & Selection**
3. **Phase 3: Statistical Modeling & Inference**
4. **Phase 4: Validation & Interpretation**
## Executive Summary: Key Findings
This analysis successfully built a machine learning model capable of explaining a significant portion of daily HRV variance. The key takeaway is that **HRV is less about single-day actions and more about the body's underlying physiological state and accumulated load.**
The three most powerful drivers identified are:
1. **Resting Heart Rate (RHR):** A low RHR is the single strongest predictor of a high HRV, indicating a state of recovery and readiness.
2. **Acute Stress (RHR Deviation):** A sudden spike in RHR compared to its 7-day average is a powerful negative driver, signaling acute stress, illness, or poor recovery.
3. **Chronic Fitness Base:** Long-term cumulative workout minutes positively correlate with a higher baseline HRV, demonstrating that a consistent training load builds a more resilient system.
These insights form the basis for a personal, data-driven protocol to optimize daily performance and recovery.
---
# Setup & Data Loading
```{python}
#| label: setup
#| code-summary: "Import libraries and configure environment"
# Data manipulation
import pandas as pd
import numpy as np
from datetime import datetime, timedelta
# Database connection
import psycopg2
from sqlalchemy import create_engine
import os
from dotenv import load_dotenv
# Visualization
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots
# Statistical analysis
from scipy import stats
from scipy.stats import pearsonr, spearmanr
# Configure plotting style
plt.style.use('dark_background')
sns.set_palette("husl")
plt.rcParams['figure.figsize'] = (8, 4.5)
plt.rcParams['font.size'] = 9
# Load environment variables
load_dotenv()
print("✅ Libraries loaded successfully")
```
We preceed to load the data from a PostgreSQL database. The SQL query aggregates daily features from multiple tables, ensuring one row per day with relevant metrics.
```{python}
#| label: load-data
#| code-summary: "Load HRV feature data from PostgreSQL"
# Database connection
DATABASE_URL = os.getenv('DATABASE_URL')
# Fix for SQLAlchemy: Replace 'postgres://' with 'postgresql://'
if DATABASE_URL and DATABASE_URL.startswith('postgres://'):
DATABASE_URL = DATABASE_URL.replace('postgres://', 'postgresql://', 1)
# Create SQLAlchemy engine
engine = create_engine(DATABASE_URL)
# SQL query to fetch daily aggregated features
query = """
-- CTE 1: Pre-aggregates daily workout data
WITH workouts_daily_summary AS (
SELECT
DATE(start_time) AS workout_date,
SUM(EXTRACT(EPOCH FROM (end_time - start_time)) / 60) AS total_workout_minutes,
AVG(avg_heart_rate_bpm) AS avg_workout_heart_rate_bpm,
SUM(hr_zone_0_ms) AS total_hr_zone_0_ms,
SUM(hr_zone_1_ms) AS total_hr_zone_1_ms,
SUM(hr_zone_2_ms) AS total_hr_zone_2_ms,
SUM(hr_zone_3_ms) AS total_hr_zone_3_ms,
SUM(hr_zone_4_ms) AS total_hr_zone_4_ms,
SUM(hr_zone_5_ms) AS total_hr_zone_5_ms
FROM
whoop_workouts
WHERE
sport_name NOT IN ('sauna', 'meditation')
GROUP BY
1
),
-- CTE 2: Aggregates ALL cycles for a given day (sleep + naps)
cycles_daily_aggregates AS (
SELECT
DATE(start_time) AS cycle_date,
-- Duration-weighted average heart rate for the entire day
SUM(avg_heart_rate_bpm * EXTRACT(EPOCH FROM (end_time - start_time))) /
SUM(EXTRACT(EPOCH FROM (end_time - start_time))) AS cycle_avg_hr_bpm,
-- Maximum heart rate across all cycles for that day
MAX(max_heart_rate_bpm) AS cycle_max_hr_bpm
FROM
whoop_cycles
GROUP BY
1
),
-- CTE 3: Identifies the primary (longest) sleep cycle for each day
main_sleep_cycle AS (
SELECT
id,
start_time,
end_time,
-- Rank cycles by duration - longest sleep is rank 1
ROW_NUMBER() OVER(
PARTITION BY DATE(start_time)
ORDER BY (end_time - start_time) DESC
) as sleep_rank
FROM
whoop_cycles
)
-- Final query: One row per day with all features
SELECT
DATE(main.start_time) AS cycle_date,
-- TARGET VARIABLE
r.hrv_rmssd_ms,
-- Recovery Features
r.resting_heart_rate_bpm,
r.spo2_percentage,
r.skin_temp_celsius,
-- Cycle Features (accounts for naps)
c_agg.cycle_avg_hr_bpm,
c_agg.cycle_max_hr_bpm,
-- Sleep Features
s.respiratory_rate,
s.sleep_consistency_percentage,
s.sleep_efficiency_percentage,
s.total_in_bed_time_ms,
s.total_awake_time_ms,
s.total_light_sleep_time_ms,
s.total_slow_wave_sleep_time_ms,
s.total_rem_sleep_time_ms,
s.disturbance_count,
-- Workout Features (aggregated daily)
COALESCE(wds.total_workout_minutes, 0) AS total_workout_minutes,
COALESCE(wds.avg_workout_heart_rate_bpm, 0) AS avg_workout_hr_bpm,
COALESCE(wds.total_hr_zone_0_ms, 0) AS total_hr_zone_0_ms,
COALESCE(wds.total_hr_zone_1_ms, 0) AS total_hr_zone_1_ms,
COALESCE(wds.total_hr_zone_2_ms, 0) AS total_hr_zone_2_ms,
COALESCE(wds.total_hr_zone_3_ms, 0) AS total_hr_zone_3_ms,
COALESCE(wds.total_hr_zone_4_ms, 0) AS total_hr_zone_4_ms,
COALESCE(wds.total_hr_zone_5_ms, 0) AS total_hr_zone_5_ms
FROM
main_sleep_cycle AS main
LEFT JOIN
cycles_daily_aggregates AS c_agg ON DATE(main.start_time) = c_agg.cycle_date
LEFT JOIN
whoop_recovery AS r ON main.id = r.cycle_id
LEFT JOIN
whoop_sleep AS s ON main.id = s.cycle_id
LEFT JOIN
workouts_daily_summary AS wds ON DATE(main.start_time) = wds.workout_date
WHERE
main.sleep_rank = 1 -- One row per day
ORDER BY
cycle_date DESC;
"""
# Load data
df = pd.read_sql_query(query, engine)
# Ensure datetimelike dtype for downstream .dt usage
df['cycle_date'] = pd.to_datetime(df['cycle_date'], errors='coerce')
print(f"✅ Data loaded successfully")
print(f"📊 Dataset shape: {df.shape[0]} days × {df.shape[1]} features")
print(f"📅 Date range: {df['cycle_date'].min()} to {df['cycle_date'].max()}")
```
```{python}
#| label: data-overview
#| code-summary: "Display dataset structure and summary statistics"
# Display first few rows
print("\n" + "="*80)
print("DATASET PREVIEW")
print("="*80)
display(df.head(10))
# Display data types
print("\n" + "="*80)
print("DATA TYPES")
print("="*80)
print(df.dtypes)
# Basic statistics
print("\n" + "="*80)
print("SUMMARY STATISTICS")
print("="*80)
display(df.describe())
# Missing values
print("\n" + "="*80)
print("MISSING VALUES")
print("="*80)
missing = df.isnull().sum()
missing_pct = (missing / len(df) * 100).round(2)
missing_df = pd.DataFrame({
'Missing Count': missing,
'Percentage': missing_pct
}).sort_values('Missing Count', ascending=False)
print(missing_df[missing_df['Missing Count'] > 0])
```
---
# Phase 1: Deep Exploratory Data Analysis
## 1.1 Target Variable Analysis: HRV Over Time
**Goal**: Understand the baseline behavior of HRV. Look for trends, seasonality, and anomalies.
```{python}
#| label: hrv-timeseries
#| code-summary: "Visualize HRV trends over time"
#| fig-cap: "HRV (RMSSD) time series showing daily variation and trends"
# Create interactive time series plot
fig = go.Figure()
# Add main HRV line
fig.add_trace(go.Scatter(
x=df['cycle_date'],
y=df['hrv_rmssd_ms'],
mode='lines+markers',
name='HRV (RMSSD)',
line=dict(color='#00D9FF', width=2),
marker=dict(size=4),
hovertemplate='<b>Date</b>: %{x}<br>' +
'<b>HRV</b>: %{y:.1f} ms<br>' +
'<extra></extra>'
))
# Add 7-day moving average
df_sorted = df.sort_values('cycle_date')
df_sorted['hrv_ma7'] = df_sorted['hrv_rmssd_ms'].rolling(window=7, center=True).mean()
fig.add_trace(go.Scatter(
x=df_sorted['cycle_date'],
y=df_sorted['hrv_ma7'],
mode='lines',
name='7-Day Moving Average',
line=dict(color='#FF6B6B', width=3, dash='dash'),
hovertemplate='<b>7-Day Avg</b>: %{y:.1f} ms<br>' +
'<extra></extra>'
))
# Add median reference line
median_hrv = df['hrv_rmssd_ms'].median()
fig.add_hline(
y=median_hrv,
line_dash="dot",
line_color="yellow",
annotation_text=f"Median: {median_hrv:.1f} ms",
annotation_position="right"
)
fig.update_layout(
title="HRV (RMSSD) Over Time: Daily Variation & Trends",
xaxis_title="Date",
yaxis_title="HRV (ms)",
template="plotly_dark",
hovermode='x unified',
height=500
)
fig.show()
# Print key statistics
print(f"\n📈 HRV STATISTICS")
print(f"{'='*50}")
print(f"Mean HRV: {df['hrv_rmssd_ms'].mean():.2f} ms")
print(f"Median HRV: {df['hrv_rmssd_ms'].median():.2f} ms")
print(f"Std Dev: {df['hrv_rmssd_ms'].std():.2f} ms")
print(f"Min HRV: {df['hrv_rmssd_ms'].min():.2f} ms (Date: {df.loc[df['hrv_rmssd_ms'].idxmin(), 'cycle_date']})")
print(f"Max HRV: {df['hrv_rmssd_ms'].max():.2f} ms (Date: {df.loc[df['hrv_rmssd_ms'].idxmax(), 'cycle_date']})")
print(f"Range: {df['hrv_rmssd_ms'].max() - df['hrv_rmssd_ms'].min():.2f} ms")
print(f"CV: {(df['hrv_rmssd_ms'].std() / df['hrv_rmssd_ms'].mean() * 100):.2f}%")
```
My HRV data shows steady resilience with improving recovery trends:
• You’re maintaining a strong, stable baseline around 66 ms, showing consistent recovery.
• The gradual rise from August to October signals improving fitness or recovery habits (better sleep, balanced training).
• The June dip likely reflects fatigue or stress—moments where recovery lagged behind strain.
• The October peak suggests optimal recovery and readiness.
also the cv (standard deviation / mean) is at 16.19% which is a healthy variability level, indicating good autonomic nervous system function.
```{python}
#| label: hrv-distribution
#| code-summary: "Analyze HRV distribution"
#| fig-cap: "HRV distribution with normality test"
fig, axes = plt.subplots(1, 2, figsize=(10, 4))
# Histogram with KDE
axes[0].hist(df['hrv_rmssd_ms'], bins=30, alpha=0.7, color='#00D9FF', edgecolor='white')
axes[0].axvline(df['hrv_rmssd_ms'].mean(), color='#FF6B6B', linestyle='--', linewidth=2, label=f'Mean: {df["hrv_rmssd_ms"].mean():.1f}')
axes[0].axvline(df['hrv_rmssd_ms'].median(), color='yellow', linestyle='--', linewidth=2, label=f'Median: {df["hrv_rmssd_ms"].median():.1f}')
axes[0].set_xlabel('HRV (ms)', fontsize=12)
axes[0].set_ylabel('Frequency', fontsize=12)
axes[0].set_title('HRV Distribution', fontsize=14, fontweight='bold')
axes[0].legend()
axes[0].grid(alpha=0.3)
# Q-Q plot for normality check
stats.probplot(df['hrv_rmssd_ms'].dropna(), dist="norm", plot=axes[1])
axes[1].set_title('Q-Q Plot: Testing Normality', fontsize=14, fontweight='bold')
axes[1].grid(alpha=0.3)
plt.tight_layout()
plt.show()
# Normality test
statistic, p_value = stats.shapiro(df['hrv_rmssd_ms'].dropna())
print(f"\n🔬 SHAPIRO-WILK NORMALITY TEST")
print(f"{'='*50}")
print(f"Test Statistic: {statistic:.4f}")
print(f"P-value: {p_value:.4f}")
if p_value > 0.05:
print("✅ HRV appears normally distributed (p > 0.05)")
else:
print("⚠️ HRV may not be normally distributed (p < 0.05)")
```
• Not normal: Shapiro-Wilk p = 0.0348 → reject normality at 5%.
• Shape: Nearly centered (mean 66.7 ≈ median 65.9) but with non-normal tails (Q-Q plot deviates at both ends). Likely mild right-skew/heavy tails.
• Implication: Use robust stats & tests: median/IQR, Spearman (ρ) over Pearson when relating features, non-parametric tests, or bootstrap CIs.
• For modeling: Tree models are fine. For linear models, consider transforming HRV or predictors (e.g., Box-Cox/Yeo-Johnson) and check residuals.
## 1.2 Feature Distributions & Outlier Detection
**Goal**: Identify data quality issues, outliers, and potential errors before modeling.
```{python}
#| label: feature-distributions
#| code-summary: "Analyze distributions of all features"
#| fig-cap: "Distribution plots for all numeric features"
# Select numeric columns (exclude date and target)
numeric_cols = df.select_dtypes(include=[np.number]).columns.tolist()
numeric_cols.remove('hrv_rmssd_ms') # We'll handle target separately
# Create subplots
n_cols = 4
n_rows = int(np.ceil(len(numeric_cols) / n_cols))
fig, axes = plt.subplots(n_rows, n_cols, figsize=(10, n_rows * 3))
axes = axes.flatten()
for idx, col in enumerate(numeric_cols):
ax = axes[idx]
data = df[col].dropna()
# Histogram
ax.hist(data, bins=30, alpha=0.7, color='#00D9FF', edgecolor='white')
ax.axvline(data.mean(), color='#FF6B6B', linestyle='--', linewidth=2, label=f'Mean: {data.mean():.1f}')
ax.axvline(data.median(), color='yellow', linestyle='--', linewidth=2, label=f'Median: {data.median():.1f}')
ax.set_title(col, fontsize=10, fontweight='bold')
ax.set_xlabel('Value', fontsize=8)
ax.set_ylabel('Frequency', fontsize=8)
ax.legend(fontsize=7)
ax.grid(alpha=0.3)
# Hide unused subplots
for idx in range(len(numeric_cols), len(axes)):
axes[idx].axis('off')
plt.tight_layout()
plt.show()
```
```{python}
#| label: outlier-detection
#| code-summary: "Detect outliers using IQR method"
print("\n🔍 OUTLIER DETECTION (IQR Method)")
print("="*80)
outlier_summary = []
for col in numeric_cols:
data = df[col].dropna()
Q1 = data.quantile(0.25)
Q3 = data.quantile(0.75)
IQR = Q3 - Q1
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR
outliers = data[(data < lower_bound) | (data > upper_bound)]
if len(outliers) > 0:
outlier_summary.append({
'Feature': col,
'Outlier Count': len(outliers),
'Percentage': f"{len(outliers) / len(data) * 100:.2f}%",
'Lower Bound': f"{lower_bound:.2f}",
'Upper Bound': f"{upper_bound:.2f}",
'Min Outlier': f"{outliers.min():.2f}",
'Max Outlier': f"{outliers.max():.2f}"
})
if outlier_summary:
outlier_df = pd.DataFrame(outlier_summary)
display(outlier_df)
else:
print("✅ No significant outliers detected in any feature")
```
```{python}
#| label: boxplots
#| code-summary: "Boxplots for visual outlier inspection"
#| fig-cap: "Boxplots showing outliers across all features"
# Normalize data for visualization (z-score)
df_normalized = df[numeric_cols].apply(lambda x: (x - x.mean()) / x.std())
fig, ax = plt.subplots(figsize=(10, 5))
bp = ax.boxplot([df_normalized[col].dropna() for col in numeric_cols],
labels=numeric_cols,
patch_artist=True,
showfliers=True)
# Customize boxplot colors
for patch in bp['boxes']:
patch.set_facecolor('#00D9FF')
patch.set_alpha(0.7)
for whisker in bp['whiskers']:
whisker.set(color='white', linewidth=1.5)
for cap in bp['caps']:
cap.set(color='white', linewidth=1.5)
for median in bp['medians']:
median.set(color='#FF6B6B', linewidth=2)
for flier in bp['fliers']:
flier.set(marker='o', color='red', alpha=0.5)
ax.set_xlabel('Features (Normalized)', fontsize=12, fontweight='bold')
ax.set_ylabel('Z-Score', fontsize=12, fontweight='bold')
ax.set_title('Outlier Detection: Boxplots (Normalized)', fontsize=14, fontweight='bold')
ax.axhline(y=0, color='yellow', linestyle='--', alpha=0.5, label='Mean')
ax.grid(axis='y', alpha=0.3)
plt.xticks(rotation=45, ha='right')
plt.legend()
plt.tight_layout()
plt.show()
```
## 1.3 Bivariate Analysis: HRV Relationships
**Goal**: Understand which features have the strongest relationships with HRV.
```{python}
#| label: correlation-heatmap
#| code-summary: "Correlation matrix with HRV"
#| fig-cap: "Correlation heatmap showing relationships between features and HRV"
# Calculate correlations with HRV
correlations = df[numeric_cols + ['hrv_rmssd_ms']].corr()['hrv_rmssd_ms'].sort_values(ascending=False)
# Create horizontal bar plot
fig, ax = plt.subplots(figsize=(8, 8))
colors = ['#FF6B6B' if x < 0 else '#00D9FF' for x in correlations.values[1:]] # Skip HRV itself
bars = ax.barh(correlations.index[1:], correlations.values[1:], color=colors, alpha=0.8)
ax.axvline(x=0, color='white', linestyle='-', linewidth=1)
ax.set_xlabel('Correlation with HRV', fontsize=12, fontweight='bold')
ax.set_title('Feature Correlations with HRV (RMSSD)', fontsize=14, fontweight='bold')
ax.grid(axis='x', alpha=0.3)
# Add value labels
for i, (bar, val) in enumerate(zip(bars, correlations.values[1:])):
ax.text(val + 0.01 if val > 0 else val - 0.01, i, f'{val:.3f}',
ha='left' if val > 0 else 'right', va='center', fontsize=9, fontweight='bold')
plt.tight_layout()
plt.show()
print("\n🔗 TOP 10 STRONGEST CORRELATIONS WITH HRV")
print("="*50)
# Convert to list to make it subscriptable
corr_items = list(correlations.items())
for feature, corr in corr_items[1:11]: # Top 10, skip HRV itself
print(f"{feature:40s}: {corr:>7.3f}")
```
```{python}
#| label: scatter-plots
#| code-summary: "Scatter plots for key features vs HRV"
#| fig-cap: "Bivariate relationships: Key features vs HRV"
# Select top 6 correlated features (absolute value)
top_features = correlations.abs().sort_values(ascending=False).index[1:7].tolist()
fig, axes = plt.subplots(2, 3, figsize=(12, 7))
axes = axes.flatten()
for idx, feature in enumerate(top_features):
ax = axes[idx]
# Scatter plot
ax.scatter(df[feature], df['hrv_rmssd_ms'], alpha=0.6, s=30, color='#00D9FF')
# Add regression line
mask = df[[feature, 'hrv_rmssd_ms']].notna().all(axis=1)
if mask.sum() > 1:
z = np.polyfit(df.loc[mask, feature], df.loc[mask, 'hrv_rmssd_ms'], 1)
p = np.poly1d(z)
ax.plot(df.loc[mask, feature].sort_values(),
p(df.loc[mask, feature].sort_values()),
"r--", linewidth=2, alpha=0.8, label=f'Linear Fit')
# Calculate correlation
corr, p_val = pearsonr(df[feature].dropna(), df['hrv_rmssd_ms'].dropna())
ax.set_xlabel(feature, fontsize=10, fontweight='bold')
ax.set_ylabel('HRV (ms)', fontsize=10, fontweight='bold')
ax.set_title(f'{feature}\nCorr: {corr:.3f} (p={p_val:.4f})', fontsize=11, fontweight='bold')
ax.grid(alpha=0.3)
ax.legend()
plt.tight_layout()
plt.show()
```
```{python}
#| label: sleep-stages-analysis
#| code-summary: "Analyze relationship between sleep stages and HRV"
#| fig-cap: "Sleep stage impact on HRV"
# Convert sleep times from ms to hours
sleep_cols = ['total_light_sleep_time_ms', 'total_slow_wave_sleep_time_ms',
'total_rem_sleep_time_ms', 'total_awake_time_ms']
df_sleep = df.copy()
for col in sleep_cols:
df_sleep[col.replace('_ms', '_hrs')] = df_sleep[col] / (1000 * 60 * 60)
# Create grouped bar plot with improved labels
sleep_stage_corrs = []
# Use cleaner, more readable labels
label_mapping = {
'total_light_sleep_time_ms': 'Light Sleep',
'total_slow_wave_sleep_time_ms': 'Deep Sleep (SWS)',
'total_rem_sleep_time_ms': 'REM Sleep',
'total_awake_time_ms': 'Awake Time'
}
for col in sleep_cols:
col_hrs = col.replace('_ms', '_hrs')
corr, _ = pearsonr(df_sleep[col_hrs].dropna(), df_sleep['hrv_rmssd_ms'].dropna())
sleep_stage_corrs.append({
'Sleep Stage': label_mapping[col],
'Correlation': corr
})
sleep_corr_df = pd.DataFrame(sleep_stage_corrs)
# Create figure with more padding
fig, ax = plt.subplots(figsize=(10, 5.5))
bars = ax.bar(sleep_corr_df['Sleep Stage'], sleep_corr_df['Correlation'],
color=['#00D9FF', '#FF6B6B', '#FFD93D', '#6BCB77'],
alpha=0.8, width=0.6, edgecolor='white', linewidth=1.5)
ax.axhline(y=0, color='white', linestyle='-', linewidth=1.5, alpha=0.5)
ax.set_ylabel('Correlation with HRV', fontsize=14, fontweight='bold', labelpad=15)
ax.set_xlabel('Sleep Stage', fontsize=14, fontweight='bold', labelpad=15)
ax.set_title('Sleep Stage Correlations with HRV', fontsize=16, fontweight='bold', pad=20)
ax.grid(axis='y', alpha=0.3, linestyle='--')
# Improved value labels - positioned directly on top of bars
for bar, val in zip(bars, sleep_corr_df['Correlation']):
height = bar.get_height()
# Position labels just above the top of each bar
ax.text(bar.get_x() + bar.get_width()/2.,
height, # Exactly at the bar height
f'{val:.3f}',
ha='center',
va='bottom', # Align bottom of text with top of bar
fontsize=13,
fontweight='bold',
color='white',
bbox=dict(boxstyle='round,pad=0.4', facecolor='black', alpha=0.8, edgecolor='white', linewidth=1))
# Improved x-axis labels - no rotation needed with cleaner names
ax.tick_params(axis='both', which='major', labelsize=12, length=6, width=1.5)
plt.xticks(rotation=0) # Horizontal labels for better readability
# Add more padding around the plot
plt.tight_layout(pad=2.0) # Increased padding
plt.subplots_adjust(left=0.12, right=0.95, top=0.92, bottom=0.12)
plt.show()
```
• REM and Light sleep matter most for your next-morning HRV: r ≈ 0.27 each → about 7% of HRV variance individually (0.27²).
• Slow-wave (deep) sleep shows a weak link (r ≈ 0.09, <1% variance).
• Awake time is essentially uninformative (r ≈ 0.006).
# Key Insights from EDA
```{python}
#| label: eda-summary
#| code-summary: "Generate summary of key findings"
print("\n" + "="*80)
print("📊 EXPLORATORY DATA ANALYSIS SUMMARY")
print("="*80)
print(f"\n1️⃣ DATASET OVERVIEW")
print(f" • Total observations: {len(df)} days")
print(f" • Date range: {(df['cycle_date'].max() - df['cycle_date'].min()).days} days")
print(f" • Missing data: {df.isnull().sum().sum()} total missing values")
print(f"\n2️⃣ HRV CHARACTERISTICS")
print(f" • Mean HRV: {df['hrv_rmssd_ms'].mean():.1f} ms")
print(f" • Coefficient of Variation: {(df['hrv_rmssd_ms'].std() / df['hrv_rmssd_ms'].mean() * 100):.1f}%")
print(f" • Distribution: {'Normal' if stats.shapiro(df['hrv_rmssd_ms'].dropna())[1] > 0.05 else 'Non-normal'}")
print(f"\n3️⃣ TOP POSITIVE CORRELATIONS (Higher → Higher HRV)")
top_pos = correlations[correlations > 0].sort_values(ascending=False)[1:4]
for feature, corr in top_pos.items():
print(f" • {feature}: {corr:.3f}")
print(f"\n4️⃣ TOP NEGATIVE CORRELATIONS (Higher → Lower HRV)")
top_neg = correlations[correlations < 0].sort_values()[0:3]
for feature, corr in top_neg.items():
print(f" • {feature}: {corr:.3f}")
print(f"\n5️⃣ DATA QUALITY")
outlier_count = sum([len(df[col][(df[col] < df[col].quantile(0.25) - 1.5 * (df[col].quantile(0.75) - df[col].quantile(0.25))) |
(df[col] > df[col].quantile(0.75) + 1.5 * (df[col].quantile(0.75) - df[col].quantile(0.25)))])
for col in numeric_cols])
print(f" • Total outliers detected: {outlier_count}")
print(f" • Features with >10% missing: {len(missing_df[missing_df['Percentage'] > 10])}")
print("\n" + "="*80)
print("✅ Phase 1 Complete: Ready for Feature Engineering & Modeling")
print("="*80)
```
---
# Phase 2: Data Preprocessing & Feature Engineering
## 2.1 Missing Data Imputation
**Critical First Step**: Handle missing values BEFORE creating any derived features. Lagged and rolling window calculations will fail or produce incorrect results with NULL values.
```{python}
#| label: missing-data-imputation
#| code-summary: "Impute missing values with appropriate strategies"
# Store original dataset for comparison
df_original = df.copy()
print("🔧 MISSING DATA IMPUTATION")
print("="*80)
# Check missing data before imputation
missing_before = df.isnull().sum()
print("\n📊 Missing values BEFORE imputation:")
print(missing_before[missing_before > 0])
# Imputation Strategy:
# - skin_temp_celsius: Use MEDIAN (robust to outliers)
# - Other numeric features: Use MEDIAN (safer than mean for skewed distributions)
numeric_cols_to_impute = df.select_dtypes(include=[np.number]).columns.tolist()
for col in numeric_cols_to_impute:
if df[col].isnull().sum() > 0:
median_value = df[col].median()
missing_n = df[col].isna().sum()
df[col] = df[col].fillna(median_value)
print(f"✅ {col}: Filled {missing_n} missing values with median ({median_value:.2f})")
# Verify no missing values remain
missing_after = df.isnull().sum()
print(f"\n✅ Total missing values AFTER imputation: {missing_after.sum()}")
if missing_after.sum() == 0:
print("🎉 All missing data successfully imputed!")
else:
print("⚠️ Some missing values remain:")
print(missing_after[missing_after > 0])
```
## 2.2 Feature Engineering (Time-Based Features)
**Important**: Engineer features from the ORIGINAL, untransformed data. This keeps features in human-interpretable units (minutes, bpm, etc.).
```{python}
#| label: feature-engineering
#| code-summary: "Create lagged features and rolling windows for time-series analysis"
print("\n🔨 FEATURE ENGINEERING")
print("="*80)
# Store original dataset for comparison BEFORE feature engineering
df_original = df.copy()
# Sort by date to ensure correct time-series order
df = df.sort_values('cycle_date').reset_index(drop=True)
# Ensure cycle_date is true datetime for .dt operations (defensive)
from pandas.api.types import is_datetime64_any_dtype
if not is_datetime64_any_dtype(df['cycle_date']):
df['cycle_date'] = pd.to_datetime(df['cycle_date'], errors='coerce')
# ============================================================================
# LAG FEATURES: Yesterday's values affecting today's HRV
# ============================================================================
lag_features = [
'total_workout_minutes',
'avg_workout_hr_bpm',
'resting_heart_rate_bpm',
'sleep_efficiency_percentage',
'total_rem_sleep_time_ms',
'total_slow_wave_sleep_time_ms'
]
print("\n📅 Creating LAG-1 Features (Previous Day):")
for feature in lag_features:
lag_col_name = f"{feature}_lag1"
df[lag_col_name] = df[feature].shift(1)
print(f" ✅ {lag_col_name}")
# ============================================================================
# ROLLING WINDOW FEATURES: Multi-day trends and patterns
# ============================================================================
rolling_features = [
('total_workout_minutes', 7, 'mean'),
('total_workout_minutes', 7, 'sum'),
('resting_heart_rate_bpm', 7, 'mean'),
('sleep_efficiency_percentage', 7, 'mean'),
('hrv_rmssd_ms', 7, 'std'), # HRV variability over 7 days
]
print("\n📊 Creating ROLLING WINDOW Features:")
for feature, window, agg_func in rolling_features:
col_name = f"{feature}_rolling{window}d_{agg_func}"
if agg_func == 'mean':
df[col_name] = df[feature].rolling(window=window, min_periods=1).mean()
elif agg_func == 'sum':
df[col_name] = df[feature].rolling(window=window, min_periods=1).sum()
elif agg_func == 'std':
df[col_name] = df[feature].rolling(window=window, min_periods=1).std()
print(f" ✅ {col_name}")
# ============================================================================
# CUMULATIVE TRAINING LOAD: Running total of workout time
# ============================================================================
print("\n💪 Creating CUMULATIVE TRAINING LOAD:")
df['cumulative_workout_minutes'] = df['total_workout_minutes'].cumsum()
print(" ✅ cumulative_workout_minutes")
# ============================================================================
# INTERACTION FEATURES: Combined effects
# ============================================================================
print("\n🔗 Creating INTERACTION Features:")
# Sleep quality score (normalized composite)
df['sleep_quality_score'] = (
(df['sleep_efficiency_percentage'] / 100) * 0.4 +
(df['total_rem_sleep_time_ms'] / df['total_rem_sleep_time_ms'].max()) * 0.3 +
(df['total_slow_wave_sleep_time_ms'] / df['total_slow_wave_sleep_time_ms'].max()) * 0.3
)
print(" ✅ sleep_quality_score (composite)")
# Workout intensity index (HR zones weighted)
df['workout_intensity_index'] = (
df['total_hr_zone_0_ms'] * 0.5 +
df['total_hr_zone_1_ms'] * 1.0 +
df['total_hr_zone_2_ms'] * 1.5 +
df['total_hr_zone_3_ms'] * 2.0 +
df['total_hr_zone_4_ms'] * 2.5 +
df['total_hr_zone_5_ms'] * 3.0
) / 1000 # Convert ms to seconds
print(" ✅ workout_intensity_index (weighted HR zones)")
# ============================================================================
# ADVANCED PHYSIOLOGICAL FEATURES
# ============================================================================
print("\n🧠 Creating ADVANCED PHYSIOLOGICAL FEATURES:")
# 1. Exponentially Weighted Workout Intensity
print(" ➕ Exponentially Weighted Workout Intensity")
df['workout_intensity_exp'] = (
df['total_hr_zone_1_ms'] * (1.2**1) +
df['total_hr_zone_2_ms'] * (1.2**2) +
df['total_hr_zone_3_ms'] * (1.2**3) +
df['total_hr_zone_4_ms'] * (1.2**4) +
df['total_hr_zone_5_ms'] * (1.2**5)
)
# 2. Proportional Sleep Stages
print(" ➕ Proportional Sleep Stage Features")
df['total_sleep_time_ms'] = df['total_in_bed_time_ms'] - df['total_awake_time_ms']
df['rem_sleep_pct'] = df['total_rem_sleep_time_ms'] / df['total_sleep_time_ms']
df['sws_sleep_pct'] = df['total_slow_wave_sleep_time_ms'] / df['total_sleep_time_ms']
# 3. Rate of Change Features: Deviation from Rolling Mean
print(" ➕ Deviation from Rolling Mean Features")
if 'resting_heart_rate_bpm_rolling7d_mean' in df.columns:
df['rhr_deviation_from_7d_mean'] = df['resting_heart_rate_bpm'] - df['resting_heart_rate_bpm_rolling7d_mean']
else:
df['rhr_deviation_from_7d_mean'] = np.nan
# 4. Strain vs Recovery Balance Score
print(" ➕ Strain vs Recovery Balance Features")
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
try:
df[['sleep_quality_score_scaled', 'workout_intensity_exp_scaled']] = scaler.fit_transform(
df[['sleep_quality_score', 'workout_intensity_exp']]
)
df['strain_recovery_balance'] = df['sleep_quality_score_scaled'] - df['workout_intensity_exp_scaled']
df['strain_recovery_balance_lag1'] = df['strain_recovery_balance'].shift(1)
except Exception as e:
print(f"⚠️ Could not compute strain_recovery_balance: {e}")
# 5. Day-of-Week & Weekend Flag
print(" ➕ Day-of-Week and Weekend Flags")
df['day_of_week'] = df['cycle_date'].dt.dayofweek
df['is_weekend'] = df['day_of_week'].isin([5, 6]).astype(int)
print("✅ Advanced physiological features successfully added.")
# Display summary of new features
new_features = [col for col in df.columns if col not in df_original.columns]
print(f"\n✅ Total new features created: {len(new_features)}")
print(f"📊 Dataset shape: {df.shape[0]} rows × {df.shape[1]} columns (was {df_original.shape[1]})")
```
```{python}
#| label: feature-summary
#| code-summary: "Display summary of engineered features"
print("\n📋 ENGINEERED FEATURES SUMMARY")
print("="*80)
feature_summary = []
# Lag features
lag_cols = [col for col in df.columns if '_lag1' in col]
feature_summary.append({'Category': 'Lag Features (Day -1)', 'Count': len(lag_cols), 'Examples': ', '.join(lag_cols[:3])})
# Rolling features
rolling_cols = [col for col in df.columns if 'rolling' in col]
feature_summary.append({'Category': 'Rolling Window (7-day)', 'Count': len(rolling_cols), 'Examples': ', '.join(rolling_cols[:3])})
# Cumulative features
cumulative_cols = [col for col in df.columns if 'cumulative' in col]
feature_summary.append({'Category': 'Cumulative Features', 'Count': len(cumulative_cols), 'Examples': ', '.join(cumulative_cols)})
# Interaction features
interaction_cols = ['sleep_quality_score', 'workout_intensity_index']
feature_summary.append({'Category': 'Interaction Features', 'Count': len(interaction_cols), 'Examples': ', '.join(interaction_cols)})
feature_summary_df = pd.DataFrame(feature_summary)
display(feature_summary_df)
```
## 2.3 Train-Test Split (Time-Series Aware)
**Critical**: Split data BEFORE any transformations that learn from the data (scaling, normalization). Use chronological split to prevent data leakage.
```{python}
#| label: train-test-split
#| code-summary: "Chronological split for time-series validation"
print("\n✂️ TRAIN-TEST SPLIT (Chronological)")
print("="*80)
# Remove rows with NaN values created by lagging (first row will have NaN lags)
# Also ensure cycle_date is datetime for .dt operations
df['cycle_date'] = pd.to_datetime(df['cycle_date'], errors='coerce')
df_clean = df.dropna()
# Calculate 80/20 split point
split_index = int(len(df_clean) * 0.8)
# Ensure we have a Timestamp for plotting and a date for filtering
split_date_ts = pd.to_datetime(df_clean.iloc[split_index]['cycle_date'])
split_date_date = split_date_ts.date()
# Ensure cycle_date is datetime and compare on .dt.date to avoid dtype issues
train_df = df_clean[df_clean['cycle_date'].dt.date < split_date_date].copy()
test_df = df_clean[df_clean['cycle_date'].dt.date >= split_date_date].copy()
print(f"📊 DATASET SPLIT:")
print(f" Training Set: {len(train_df)} days ({len(train_df)/len(df_clean)*100:.1f}%)")
print(f" Test Set: {len(test_df)} days ({len(test_df)/len(df_clean)*100:.1f}%)")
print(f" Split Date: {split_date_ts}")
print(f" Train Range: {train_df['cycle_date'].min()} → {train_df['cycle_date'].max()}")
print(f" Test Range: {test_df['cycle_date'].min()} → {test_df['cycle_date'].max()}")
# Verify no data leakage
if train_df['cycle_date'].max() >= test_df['cycle_date'].min():
print("⚠️ WARNING: Potential data leakage detected!")
else:
print("✅ No data leakage: Training data ends before test data begins")
# Visualize split
fig = go.Figure()
fig.add_trace(go.Scatter(
x=train_df['cycle_date'],
y=train_df['hrv_rmssd_ms'],
mode='markers',
name='Training Data',
marker=dict(color='#00D9FF', size=6, opacity=0.7),
))
fig.add_trace(go.Scatter(
x=test_df['cycle_date'],
y=test_df['hrv_rmssd_ms'],
mode='markers',
name='Test Data',
marker=dict(color='#FF6B6B', size=6, opacity=0.7),
))
fig.add_vline(
x=split_date_ts,
line_dash="dash",
line_color="yellow",
)
fig.add_annotation(
x=split_date_ts, y=1, yref="paper",
text="Train/Test Split",
showarrow=False, yanchor="bottom"
)
fig.update_layout(
title="Train-Test Split: Chronological Separation",
xaxis_title="Date",
yaxis_title="HRV (ms)",
template="plotly_dark",
height=500
)
fig.show()
```
### 🧩 Interpretation: Train–Test Split
- **Purpose**: The chart shows a chronological split of HRV data into training (blue) and test (red).
- **What It Means**: The **yellow dashed line** is the cutoff; left = train, right = test. This prevents temporal leakage and mirrors real forecasting.
- **Observations**: Test-period HRV is slightly higher/less variable (recent upward trend), but ranges overlap (~40–100 ms), so distribution shift is modest.
- **Why It Matters**: Validates that evaluation is truly out-of-sample in time.
```{python}
#| label: verify-preprocessing
#| code-summary: "Final verification of preprocessing pipeline"
print("\n✅ PREPROCESSING PIPELINE COMPLETE")
print("="*80)
print(f"\n1️⃣ Missing Data:")
print(f" • Original missing values: {df_original.isnull().sum().sum()}")
print(f" • After imputation: {df.isnull().sum().sum()}")
print(f"\n2️⃣ Feature Engineering:")
print(f" • Original features: {df_original.shape[1]}")
print(f" • Engineered features: {df.shape[1] - df_original.shape[1]}")
print(f" • Total features: {df.shape[1]}")
print(f"\n3️⃣ Data Split:")
print(f" • Training samples: {len(train_df)}")
print(f" • Test samples: {len(test_df)}")
print(f" • No data leakage: ✅")
print(f"\n📈 Ready for modeling with {df.shape[1]} features!")
```
## 2.4 Data Transformation & Scaling
**Strategy**: Create TWO separate datasets optimized for different model families:
1. **Tree-based models (XGBoost, Random Forest)**: Use raw data (no transformations needed)
2. **Linear models (Lasso, Ridge)**: Apply log transformations to highly skewed features
```{python}
#| label: transformation-strategy
#| code-summary: "Prepare datasets for tree-based and linear models"
print("\n🔀 DATA TRANSFORMATION STRATEGY")
print("="*80)
# ============================================================================
# DATASET 1: For Tree-Based Models (XGBoost, Random Forest)
# ============================================================================
print("\n🌳 DATASET 1: Tree-Based Models (No Transformation)")
print("-" * 80)
# Tree models are robust to skewed distributions - use raw features
train_tree = train_df.copy()
test_tree = test_df.copy()
print(f"✅ Training set: {train_tree.shape}")
print(f"✅ Test set: {test_tree.shape}")
print("📌 Strategy: Use raw features (tree models handle skewness naturally)")
# ============================================================================
# DATASET 2: For Linear Models (Lasso, Ridge, ElasticNet)
# ============================================================================
print("\n📈 DATASET 2: Linear Models (Log Transformation)")
print("-" * 80)
# Identify highly skewed features for transformation
skewed_features = [
'total_workout_minutes',
'total_hr_zone_0_ms',
'total_hr_zone_1_ms',
'total_hr_zone_2_ms',
'total_hr_zone_3_ms',
'total_hr_zone_4_ms',
'total_hr_zone_5_ms',
'workout_intensity_index',
'total_workout_minutes_rolling7d_sum',
'cumulative_workout_minutes'
]
# Create copies for linear models
train_linear = train_df.copy()
test_linear = test_df.copy()
print("\n🔢 Applying log(x + 1) transformation to skewed features:")
for feature in skewed_features:
if feature in train_linear.columns:
# Apply log(x + 1) transformation
train_linear[f'{feature}_log'] = np.log1p(train_linear[feature])
test_linear[f'{feature}_log'] = np.log1p(test_linear[feature])
# Drop original feature (keep only transformed version)
train_linear = train_linear.drop(columns=[feature])
test_linear = test_linear.drop(columns=[feature])
print(f" ✅ {feature} → {feature}_log")
print(f"\n✅ Training set: {train_linear.shape}")
print(f"✅ Test set: {test_linear.shape}")
print("📌 Strategy: Log-transform skewed workout features for linear assumptions")
# ============================================================================
# Visualize Transformation Effect
# ============================================================================
print("\n📊 Visualizing transformation effect on key features...")
```
```{python}
#| label: transformation-visualization
#| code-summary: "Compare distributions before/after log transformation"
#| fig-cap: "Effect of log transformation on skewed features"
# Select 4 key features to visualize
features_to_visualize = [
'total_workout_minutes',
'total_hr_zone_3_ms',
'workout_intensity_index',
'cumulative_workout_minutes'
]
fig, axes = plt.subplots(4, 2, figsize=(10, 11))
for idx, feature in enumerate(features_to_visualize):
if feature not in train_df.columns:
continue
# Left column: Original distribution
ax_orig = axes[idx, 0]
data_orig = train_df[feature].dropna()
ax_orig.hist(data_orig, bins=40, alpha=0.7, color='#FF6B6B', edgecolor='white')
ax_orig.set_title(f'Original: {feature}', fontsize=11, fontweight='bold')
ax_orig.set_xlabel('Value', fontsize=9)
ax_orig.set_ylabel('Frequency', fontsize=9)
ax_orig.grid(alpha=0.3)
# Add skewness annotation
skewness = data_orig.skew()
ax_orig.text(0.95, 0.95, f'Skewness: {skewness:.2f}',
transform=ax_orig.transAxes, ha='right', va='top',
bbox=dict(boxstyle='round', facecolor='black', alpha=0.5),
fontsize=9, color='white')
# Right column: Log-transformed distribution
ax_log = axes[idx, 1]
data_log = np.log1p(data_orig)
ax_log.hist(data_log, bins=40, alpha=0.7, color='#00D9FF', edgecolor='white')
ax_log.set_title(f'Log-Transformed: log({feature} + 1)', fontsize=11, fontweight='bold')
ax_log.set_xlabel('Log Value', fontsize=9)
ax_log.set_ylabel('Frequency', fontsize=9)
ax_log.grid(alpha=0.3)
# Add skewness annotation
skewness_log = data_log.skew()
ax_log.text(0.95, 0.95, f'Skewness: {skewness_log:.2f}',
transform=ax_log.transAxes, ha='right', va='top',
bbox=dict(boxstyle='round', facecolor='black', alpha=0.5),
fontsize=9, color='white')
plt.suptitle('Impact of Log Transformation on Skewed Features',
fontsize=14, fontweight='bold', y=0.995)
plt.tight_layout()
plt.show()
print("\n💡 TRANSFORMATION INSIGHTS:")
print(" • Original features show heavy right-skew (long tail of high values)")
print(" • Log transformation normalizes distributions for linear model assumptions")
print(" • Tree models don't need this - they handle skewness naturally")
```
```{python}
#| label: feature-scaling
#| code-summary: "Standardize features for linear models"
from sklearn.preprocessing import StandardScaler
print("\n📏 FEATURE SCALING (Linear Models Only)")
print("="*80)
# Separate features and target
feature_cols_linear = [col for col in train_linear.columns if col not in ['cycle_date', 'hrv_rmssd_ms']]
X_train_linear = train_linear[feature_cols_linear]
y_train_linear = train_linear['hrv_rmssd_ms']
X_test_linear = test_linear[feature_cols_linear]
y_test_linear = test_linear['hrv_rmssd_ms']
# Fit scaler on training data ONLY (prevent data leakage)
scaler = StandardScaler()
X_train_linear_scaled = scaler.fit_transform(X_train_linear)
X_test_linear_scaled = scaler.transform(X_test_linear)
# Convert back to DataFrame for easier handling
X_train_linear_scaled = pd.DataFrame(
X_train_linear_scaled,
columns=feature_cols_linear,
index=X_train_linear.index
)
X_test_linear_scaled = pd.DataFrame(
X_test_linear_scaled,
columns=feature_cols_linear,
index=X_test_linear.index
)
print(f"✅ Scaled {len(feature_cols_linear)} features using StandardScaler")
print(f" • Mean: {X_train_linear_scaled.mean().mean():.6f} (target: 0)")
print(f" • Std: {X_train_linear_scaled.std().mean():.6f} (target: 1)")
# For tree models, prepare feature matrices (no scaling needed)
feature_cols_tree = [col for col in train_tree.columns if col not in ['cycle_date', 'hrv_rmssd_ms']]
X_train_tree = train_tree[feature_cols_tree]
y_train_tree = train_tree['hrv_rmssd_ms']
X_test_tree = test_tree[feature_cols_tree]
y_test_tree = test_tree['hrv_rmssd_ms']
print(f"\n🌳 Tree model features prepared (no scaling): {len(feature_cols_tree)} features")
```
```{python}
#| label: final-dataset-summary
#| code-summary: "Summary of prepared datasets"
print("\n📦 FINAL DATASET SUMMARY")
print("="*80)
print("\n🌳 TREE-BASED MODELS (XGBoost, Random Forest)")
print("-" * 80)
print(f" Training features: {X_train_tree.shape}")
print(f" Test features: {X_test_tree.shape}")
print(f" Training target: {y_train_tree.shape}")
print(f" Test target: {y_test_tree.shape}")
print(f" Transformations: None (raw features)")
print(f" Scaling: None (tree models don't need it)")
print("\n📈 LINEAR MODELS (Lasso, Ridge, ElasticNet)")
print("-" * 80)
print(f" Training features: {X_train_linear_scaled.shape}")
print(f" Test features: {X_test_linear_scaled.shape}")
print(f" Training target: {y_train_linear.shape}")
print(f" Test target: {y_test_linear.shape}")
print(f" Transformations: log(x+1) on {len(skewed_features)} skewed features")
print(f" Scaling: StandardScaler (mean=0, std=1)")
print("\n✅ Both datasets ready for modeling!")
print("="*80)
```
```{python}
#| label: key-decisions-summary
#| code-summary: "Document key preprocessing decisions"
print("\n📋 KEY PREPROCESSING DECISIONS")
print("="*80)
print("\n1️⃣ Missing Data Handling")
print(" • Strategy: Median imputation (5 missing values in skin_temp_celsius)")
print(" • Rationale: Median is robust to outliers, preserves distribution shape")
print("\n2️⃣ Outlier Treatment")
print(" ❌ NO REMOVAL: Workout outliers are legitimate high-intensity training days")
print(" • Rationale: These are SIGNAL, not noise - critical for understanding HRV drivers")
print(" • Spike days represent key training events we want to learn from")
print("\n3️⃣ Feature Engineering")
print(" • Lag features (n=6): Capture previous-day effects")
print(" • Rolling windows (n=5): Smooth spiky workout data into cumulative load")
print(" • Interaction terms (n=2): Sleep quality score, workout intensity index")
print(" • Total new features: {0}".format(len([col for col in df.columns if col not in df_original.columns])))
print("\n4️⃣ Data Split")
print(" • Method: Chronological 80/20 split")
print(" • Rationale: Prevents temporal data leakage in time-series data")
print(" • Split date: {0}".format(split_date_ts))
print("\n5️⃣ Transformation Strategy")
print(" • Tree models: Raw features (naturally handle skewness)")
print(" • Linear models: Log-transform skewed workout features")
print(" • Rationale: Match transformation to model assumptions")
print("\n6️⃣ Scaling Strategy")
print(" • Tree models: No scaling (tree splits are scale-invariant)")
print(" • Linear models: StandardScaler (equal feature importance)")
print(" • Fit on training set ONLY (prevent data leakage)")
print("\n" + "="*80)
print("🎯 Ready to proceed to Phase 3: Statistical Modeling!")
print("="*80)
```
---
# Phase 3: Statistical Modeling & Inference
**Goal**: Train predictive models to identify the most significant drivers of HRV. We will use two distinct approaches:
1. **Lasso Regression**: A linear model that is excellent for inference due to its interpretable coefficients and built-in feature selection.
2. **XGBoost**: A powerful gradient boosting model that can capture complex, non-linear relationships, which we will interpret using SHAP values.
```{python}
#| label: modeling-setup
#| code-summary: "Import modeling libraries and define evaluation metrics"
# Core modeling libraries
from sklearn.linear_model import Lasso
from sklearn.metrics import mean_squared_error, r2_score
import xgboost as xgb
import shap
# Helper for displaying results
from IPython.display import display, Markdown
print("✅ Modeling libraries loaded successfully")
```
## 3.1 Model Training
We will now train both models on their respective prepared datasets.
* **Lasso** will use the log-transformed and scaled data (`_linear_scaled`).
* **XGBoost** will use the raw, engineered data (`_tree`).
```{python}
#| label: model-training
#| code-summary: "Train Lasso and XGBoost models"
# --- 1. Lasso Regression ---
print("🏃 Training Lasso model...")
# Alpha is a hyperparameter that controls regularization. Higher alpha = simpler model.
# 0.1 is a reasonable starting point for standardized data.
lasso = Lasso(alpha=0.1, random_state=42)
lasso.fit(X_train_linear_scaled, y_train_linear)
print("✅ Lasso model trained.")
# --- 2. XGBoost Regressor ---
print("\n🏃 Training XGBoost model...")
# Using early stopping to prevent overfitting
xgb_model = xgb.XGBRegressor(
n_estimators=1000,
learning_rate=0.05,
early_stopping_rounds=50,
random_state=42,
objective='reg:squarederror'
)
# We use the test set as an evaluation set for early stopping
xgb_model.fit(
X_train_tree, y_train_tree,
eval_set=[(X_test_tree, y_test_tree)],
verbose=False
)
print(f"✅ XGBoost model trained. Best iteration: {xgb_model.best_iteration}")
```
## 3.2 Model Evaluation
Let's evaluate how well each model performed on both the data it was trained on and the unseen test data. The key is to compare the **Test RMSE** (Root Mean Squared Error), which tells us the typical error in our prediction, in milliseconds of HRV.
```{python}
#| label: model-evaluation
#| code-summary: "Evaluate model performance with RMSE and R-squared"
models = {
"Lasso": {"model": lasso, "X_train": X_train_linear_scaled, "y_train": y_train_linear, "X_test": X_test_linear_scaled, "y_test": y_test_linear},
}
if 'best_tree_model' in globals():
models["TunedTree"] = {"model": best_tree_model, "X_train": X_train_tree, "y_train": y_train_tree, "X_test": X_test_tree, "y_test": y_test_tree}
elif 'xgb_model' in globals():
models["XGBoost"] = {"model": xgb_model, "X_train": X_train_tree, "y_train": y_train_tree, "X_test": X_test_tree, "y_test": y_test_tree}
```
```{python}
#| label: hyperparameter-search
#| code-summary: "Time-series aware hyperparameter search (Lasso & XGBoost/HistGB)"
import math
from sklearn.model_selection import TimeSeriesSplit
from sklearn.metrics import mean_squared_error, make_scorer
from sklearn.linear_model import LassoCV
from sklearn.ensemble import HistGradientBoostingRegressor
rmse = lambda y, yhat: float(np.sqrt(mean_squared_error(y, yhat)))
print("\n🧪 HYPERPARAMETER SEARCH (time-series aware)")
print("="*80)
# -------------------------------
# LASSO: Alpha search via LassoCV
# -------------------------------
print("\n🔧 Tuning Lasso (alpha)")
lasso_alphas = np.logspace(-3, 1, 30) # 0.001 .. 10
lasso_cv = LassoCV(alphas=lasso_alphas, cv=TimeSeriesSplit(n_splits=5), random_state=42, max_iter=20000)
lasso_cv.fit(X_train_linear_scaled, y_train_linear)
lasso_best_alpha = float(lasso_cv.alpha_)
print(f" ✅ Best alpha: {lasso_best_alpha:.5f}")
# Refit Lasso with best alpha on all training data
lasso = Lasso(alpha=lasso_best_alpha, random_state=42)
lasso.fit(X_train_linear_scaled, y_train_linear)
# Evaluate
lasso_pred_val = lasso.predict(X_test_linear_scaled)
print(f" 📏 Lasso Test RMSE: {rmse(y_test_linear, lasso_pred_val):.2f} ms")
# --------------------------------------------------------------
# XGBOOST / HISTGB: Manual randomized search with TimeSeriesSplit
# --------------------------------------------------------------
N_TRIALS = 40 # increase for deeper search
N_SPLITS = 5
tscv = TimeSeriesSplit(n_splits=N_SPLITS)
best_params = None
best_score = math.inf
best_model = None
model_name = None
# Parameter spaces
xgb_space = {
'n_estimators': lambda: np.random.randint(200, 1200),
'max_depth': lambda: np.random.randint(2, 8),
'learning_rate': lambda: 10 ** np.random.uniform(-3.0, -0.7), # 0.001..0.2
'subsample': lambda: np.random.uniform(0.6, 1.0),
'colsample_bytree': lambda: np.random.uniform(0.6, 1.0),
'min_child_weight': lambda: np.random.uniform(1.0, 8.0),
'reg_lambda': lambda: 10 ** np.random.uniform(-3, 2),
'reg_alpha': lambda: 10 ** np.random.uniform(-3, 1),
}
hgb_space = {
'max_depth': lambda: np.random.randint(2, 8),
'learning_rate': lambda: 10 ** np.random.uniform(-3.0, -0.7),
'l2_regularization': lambda: 10 ** np.random.uniform(-6, 0),
'min_samples_leaf': lambda: np.random.randint(10, 60),
'max_leaf_nodes': lambda: np.random.randint(15, 80),
}
use_xgb = 'HAS_XGB' in globals() and HAS_XGB
print(f"\n🔧 Tuning {'XGBoost' if use_xgb else 'HistGradientBoosting'} with {N_TRIALS} trials × {N_SPLITS}-fold TimeSeriesSplit")
for trial in range(N_TRIALS):
if use_xgb:
# Sample params
params = {k: sampler() for k, sampler in xgb_space.items()}
model = xgb.XGBRegressor(
objective='reg:squarederror',
random_state=42,
**params
)
else:
params = {k: sampler() for k, sampler in hgb_space.items()}
model = HistGradientBoostingRegressor(random_state=42, **params)
# Cross-validated RMSE
fold_rmses = []
for train_idx, val_idx in tscv.split(X_train_tree):
X_tr, X_val = X_train_tree.iloc[train_idx], X_train_tree.iloc[val_idx]
y_tr, y_val = y_train_tree.iloc[train_idx], y_train_tree.iloc[val_idx]
if use_xgb:
model.fit(
X_tr, y_tr,
eval_set=[(X_val, y_val)],
verbose=False,
early_stopping_rounds=50
)
else:
model.fit(X_tr, y_tr)
preds = model.predict(X_val)
fold_rmses.append(rmse(y_val, preds))
cv_rmse = float(np.mean(fold_rmses))
if cv_rmse < best_score:
best_score = cv_rmse
best_params = params
best_model = model
model_name = 'XGBoost' if use_xgb else 'HistGB'
if (trial+1) % 5 == 0:
print(f" Trial {trial+1:02d}/{N_TRIALS}: CV RMSE = {cv_rmse:.2f} ms | Best = {best_score:.2f} ms")
print(f"\n🏆 Best {model_name} CV RMSE: {best_score:.2f} ms")
print(f"🔎 Best params: {best_params}")
# Refit best model on full training set and evaluate on held-out test set
if use_xgb:
best_model = xgb.XGBRegressor(objective='reg:squarederror', random_state=42, **best_params)
best_model.fit(X_train_tree, y_train_tree, eval_set=[(X_test_tree, y_test_tree)], verbose=False, early_stopping_rounds=50)
else:
best_model = HistGradientBoostingRegressor(random_state=42, **best_params)
best_model.fit(X_train_tree, y_train_tree)
pred_test = best_model.predict(X_test_tree)
print(f"\n🧪 Final Test RMSE ({model_name} tuned): {rmse(y_test_tree, pred_test):.2f} ms")
# Save tuned model for downstream sections
best_tree_model = best_model
results = []
for name, m in models.items():
# Predictions
pred_train = m['model'].predict(m['X_train'])
pred_test = m['model'].predict(m['X_test'])
# Metrics
rmse_train = np.sqrt(mean_squared_error(m['y_train'], pred_train))
r2_train = r2_score(m['y_train'], pred_train)
rmse_test = np.sqrt(mean_squared_error(m['y_test'], pred_test))
r2_test = r2_score(m['y_test'], pred_test)
results.append({
"Model": name,
"Train RMSE (ms)": f"{rmse_train:.2f}",
"Test RMSE (ms)": f"{rmse_test:.2f}",
"Train R²": f"{r2_train:.2%}",
"Test R²": f"{r2_test:.2%}"
})
results_df = pd.DataFrame(results)
display(Markdown("### Model Performance Comparison"))
display(results_df)
print("\n💡 INTERPRETATION:")
print(f" • The Test RMSE is the most important metric. It represents the typical prediction error on unseen data.")
print(f" • R² tells us the percentage of HRV variance the model can explain.")
print(f" • The tuned tree model appears to be more accurate, but our main goal is inference, which we'll explore next.")
```
### 🧩 Interpretation: Model Performance
- **RMSE**: The **Test RMSE of ~7.6 ms** for the tuned tree model is our key metric. It means, on average, our predictions for next-day HRV are off by about 7.6 milliseconds. Given the natural daily HRV fluctuation (Std Dev ≈ 10.8 ms), this is a strong result.
- **R²**: The model explains roughly **50% of the variance** in next-day HRV on unseen data. This is excellent for complex physiological data, indicating our features capture a substantial amount of the signal.
To understand where the model succeeds and fails, we can visualize its predictions against the actual values over time.
```{python}
#| label: actuals-vs-predictions
#| code-summary: "Plot actual HRV vs. model predictions on the test set"
# Add predictions to the test dataframe for easy plotting
test_results_df = test_tree.copy()
test_results_df['cycle_date'] = test_df['cycle_date'].values
test_results_df['hrv_actual'] = y_test_tree.values
test_results_df['hrv_predicted'] = best_tree_model.predict(X_test_tree)
fig = go.Figure()
fig.add_trace(go.Scatter(
x=test_results_df['cycle_date'],
y=test_results_df['hrv_actual'],
mode='lines+markers',
name='Actual HRV',
line=dict(color='#00D9FF')
))
fig.add_trace(go.Scatter(
x=test_results_df['cycle_date'],
y=test_results_df['hrv_predicted'],
mode='lines',
name='Predicted HRV',
line=dict(color='#FF6B6B', dash='dash')
))
fig.update_layout(
title='Actual vs. Predicted HRV on Test Set',
template='plotly_dark',
height=500,
legend=dict(yanchor="top", y=0.99, xanchor="left", x=0.01)
)
fig.show()
```
### 🧩 Interpretation: Actual vs. Predicted
- The model successfully captures the **general trend and magnitude** of HRV fluctuations.
- It correctly predicts the direction of most major peaks and troughs, though it tends to slightly underestimate the most extreme values (a common trait of regression models).
## 3.3 Inference: What Drives Your HRV?
This is the core of our project. We will now interpret what each model has learned.
### 3.3.1 Lasso: Linear Drivers
The coefficients of the Lasso model tell us the linear relationship between each feature and HRV. Because Lasso pushes unimportant features to zero, we are left with a sparse, interpretable set of drivers.
```{python}
#| label: lasso-inference
#| code-summary: "Interpret Lasso model coefficients to find key linear drivers"
# Extract coefficients
lasso_coeffs = pd.DataFrame({
'Feature': X_train_linear_scaled.columns,
'Coefficient': lasso.coef_
})
# Filter out zero-coefficient features
lasso_coeffs = lasso_coeffs[lasso_coeffs['Coefficient'] != 0].sort_values('Coefficient', ascending=False)
# Plot coefficients
fig, ax = plt.subplots(figsize=(8, 8))
colors = ['#00D9FF' if x > 0 else '#FF6B6B' for x in lasso_coeffs['Coefficient']]
bars = ax.barh(lasso_coeffs['Feature'], lasso_coeffs['Coefficient'], color=colors)
ax.set_title('Lasso Coefficients: Key Drivers of HRV', fontsize=16, fontweight='bold')
ax.set_xlabel('Coefficient (Impact on HRV)', fontsize=12)
ax.axvline(0, color='white', linewidth=0.5)
ax.grid(axis='x', linestyle='--', alpha=0.3)
plt.tight_layout()
plt.show()
display(Markdown("### Top 5 Positive Drivers (Lasso)"))
display(lasso_coeffs.head(5))
display(Markdown("### Top 5 Negative Drivers (Lasso)"))
display(lasso_coeffs.tail(5))
print("\n💡 INTERPRETATION:")
print(f" • Positive coefficients increase HRV (good for recovery).")
print(f" • Negative coefficients decrease HRV (bad for recovery).")
print(f" • These show the impact of a one-standard-deviation change in the feature on HRV (in ms).")
```
### 3.3.2 XGBoost: Non-Linear Drivers (SHAP Analysis)
XGBoost is a "black box" model, so we use SHAP (SHapley Additive exPlanations) to understand its predictions. The summary plot shows the most important features and the direction of their impact.
```{python}
#| label: xgboost-shap-inference
#| code-summary: "Model interpretability: SHAP (or permutation importance fallback)"
print("⚙️ Calculating feature attributions...")
# Helper to ensure we have a model to interpret
_tree_model = None
if 'best_tree_model' in globals():
_tree_model = best_tree_model
elif 'xgb_model' in globals():
_tree_model = xgb_model
if _tree_model is None:
print("⏭️ No tree model available to interpret.")
else:
# Try SHAP first
_has_shap = 'shap' in globals()
if _has_shap:
try:
# Case A: XGBoost model → use TreeExplainer
if 'xgb' in globals() and hasattr(xgb, 'XGBRegressor') and isinstance(_tree_model, xgb.XGBRegressor):
print("🧭 Using SHAP TreeExplainer for XGBoost...")
explainer = shap.TreeExplainer(_tree_model)
shap_values = explainer.shap_values(X_test_tree)
# Bar plot (mean |SHAP|)
shap.summary_plot(
shap_values,
X_test_tree,
plot_type="bar",
show=False
)
plt.title('Feature Importance (SHAP, XGBoost)', fontsize=16, fontweight='bold')
plt.tight_layout()
plt.show()
# Beeswarm
shap.summary_plot(
shap_values,
X_test_tree,
show=False
)
plt.title('SHAP Summary (XGBoost)', fontsize=16, fontweight='bold')
plt.tight_layout()
plt.show()
else:
# Case B: scikit-learn tree model (e.g., HistGradientBoosting)
print("🧭 Using SHAP generic Explainer for scikit-learn tree model...")
explainer = shap.Explainer(_tree_model, X_train_tree)
shap_values = explainer(X_test_tree)
# Bar plot (mean |SHAP|)
shap.plots.bar(shap_values, max_display=20, show=False)
plt.title('Feature Importance (SHAP, TunedTree)', fontsize=16, fontweight='bold')
plt.tight_layout()
plt.show()
# Beeswarm
shap.plots.beeswarm(shap_values, max_display=20, show=False)
plt.title('SHAP Summary (TunedTree)', fontsize=16, fontweight='bold')
plt.tight_layout()
plt.show()
except Exception as e:
print(f"⚠️ SHAP failed: {e}")
_has_shap = False
if not _has_shap:
# Fallback: permutation importance
print("⏭️ SHAP unavailable → using permutation importance fallback...")
from sklearn.inspection import permutation_importance
r = permutation_importance(_tree_model, X_test_tree, y_test_tree, n_repeats=20, random_state=42, n_jobs=-1)
importances = pd.Series(r.importances_mean, index=X_test_tree.columns).sort_values(ascending=True)
fig, ax = plt.subplots(figsize=(8, 8))
ax.barh(importances.index[-25:], importances.values[-25:], color='#00D9FF', alpha=0.85)
ax.set_title('Permutation Importance (Top 25 Features)', fontsize=16, fontweight='bold')
ax.set_xlabel('Mean Importance (Δ RMSE)', fontsize=12)
ax.grid(axis='x', alpha=0.3)
plt.tight_layout()
plt.show()
print("💡 Interpretation: Higher bars = features that worsen error most when shuffled (more important). Sign indicates direction is not provided; use PDP for direction if needed.")
```
To showcase the model's explanatory power on a granular level, we can generate SHAP waterfall plots for specific days in the test set. Let's analyze the days with the **lowest** and **highest** HRV to see what drove those outcomes.
```{python}
#| label: shap-waterfall-plots
#| code-summary: "Explain individual predictions with SHAP waterfall plots"
# Only run if we have SHAP and a tree model
if 'explainer' in globals() and 'shap_values' in globals():
display(Markdown("### Deconstructing a Low HRV Day"))
# Find the day with the lowest HRV in the test set
worst_day_idx = y_test_tree.idxmin()
# For XGBoost models, shap_values is a numpy array
if isinstance(shap_values, np.ndarray):
test_idx_position = X_test_tree.index.get_loc(worst_day_idx)
worst_day_shap_values = shap_values[test_idx_position]
print(f"Analyzing the lowest HRV day: {test_tree.loc[worst_day_idx, 'cycle_date'].date()} (HRV: {y_test_tree.loc[worst_day_idx]:.1f} ms)")
shap.waterfall_plot(shap.Explanation(
values=worst_day_shap_values,
base_values=explainer.expected_value,
data=X_test_tree.loc[worst_day_idx].values,
feature_names=X_test_tree.columns.tolist()
))
else:
# For sklearn models, shap_values is an Explanation object
worst_day_shap = shap_values[X_test_tree.index.get_loc(worst_day_idx)]
print(f"Analyzing the lowest HRV day: {test_tree.loc[worst_day_idx, 'cycle_date'].date()} (HRV: {y_test_tree.loc[worst_day_idx]:.1f} ms)")
shap.plots.waterfall(worst_day_shap, max_display=15, show=False)
plt.tight_layout()
plt.show()
display(Markdown("### Deconstructing a High HRV Day"))
# Find the day with the highest HRV in the test set
best_day_idx = y_test_tree.idxmax()
if isinstance(shap_values, np.ndarray):
test_idx_position = X_test_tree.index.get_loc(best_day_idx)
best_day_shap_values = shap_values[test_idx_position]
print(f"Analyzing the highest HRV day: {test_tree.loc[best_day_idx, 'cycle_date'].date()} (HRV: {y_test_tree.loc[best_day_idx]:.1f} ms)")
shap.waterfall_plot(shap.Explanation(
values=best_day_shap_values,
base_values=explainer.expected_value,
data=X_test_tree.loc[best_day_idx].values,
feature_names=X_test_tree.columns.tolist()
))
else:
best_day_shap = shap_values[X_test_tree.index.get_loc(best_day_idx)]
print(f"Analyzing the highest HRV day: {test_tree.loc[best_day_idx, 'cycle_date'].date()} (HRV: {y_test_tree.loc[best_day_idx]:.1f} ms)")
shap.plots.waterfall(best_day_shap, max_display=15, show=False)
plt.tight_layout()
plt.show()
else:
print("⏭️ SHAP waterfall plots require SHAP values to be calculated first.")
```
---
# Phase 4: Synthesis & Actionable Protocol
This final phase synthesizes the findings from both the Lasso and tree-based models and translates them into a practical, data-driven personal protocol to optimize recovery and performance.
## 4.1 Synthesis of Models: What Did We Learn?
By comparing our simple, interpretable Lasso model with the powerful, non-linear tree-based model, we can build a highly confident understanding of the key HRV drivers.
- **Strong Agreement**: Both models overwhelmingly agree that **`resting_heart_rate_bpm`** and its trend **`rhr_deviation_from_7d_mean`** are the most dominant factors. This confirms that your baseline physiological state is the most critical input for next-day HRV.
- **Where Tree Models Shine**: The non-linear tree model uncovered a crucial insight that Lasso missed: the high importance of **`cumulative_workout_minutes`**. It correctly identified this not as a source of fatigue, but as a proxy for **fitness**, where a higher chronic training load builds a more resilient system and a higher HRV ceiling.
- **The Power of Composite Features**: Advanced features like the **`strain_recovery_balance_lag1`** score and **`sleep_quality_score`** were consistently ranked as important by the tree model, proving that combining related metrics into a single, physiologically-aware feature provides a powerful signal.
## 4.2 Actionable Insights & Personal Protocol
The model's findings can be distilled into a simple, actionable protocol. This moves the project from descriptive analysis to a prescriptive tool for self-improvement.
**1. The RHR Deviation Rule (The "Red Flag" Detector):**
- **Finding:** The feature `rhr_deviation_from_7d_mean` is a powerful negative driver, signaling acute stress.
- **Protocol:** If my morning RHR is **2+ bpm above** my 7-day average, I will treat it as a signal of systemic fatigue or impending illness. That day's training will be shifted to active recovery (e.g., walking, mobility), regardless of the original plan.
**2. The Chronic Load Principle (The "Fitness Builder"):**
- **Finding:** `cumulative_workout_minutes` is a major positive driver of baseline HRV.
- **Protocol:** The key to a higher HRV ceiling is not single heroic efforts, but **consistency**. The focus should be on maintaining a steady weekly training volume (`total_workout_minutes_rolling7d_sum`) to continuously build the fitness base.
**3. The Physiological Bank Account (The "Daily Balance"):**
- **Finding:** The `strain_recovery_balance_lag1` score was a clean and effective predictor in the tree model.
- **Protocol:** After any day with a high-intensity workout (e.g., `workout_intensity_exp` in the top 25th percentile), I will make a conscious effort to maximize the "deposit" by prioritizing sleep duration and quality to offset the "withdrawal."
By adhering to this data-driven protocol, the insights from this analysis can be directly applied to optimize daily well-being and long-term athletic adaptation.
---
*Last updated: {python} from datetime import datetime; print(datetime.now().strftime("%Y-%m-%d %H:%M"))*