Data-Driven Methods
Advanced Track
This module extends Lab 07: Data Analysis with deeper theory on regression, system identification, and model validation.
Prerequisites
- Completed Lab 07 (basic data collection and model fitting)
- Understanding of linear relationships
- Basic statistics (mean, standard deviation)
1. Regression Theory
Why Regression?
In embedded systems, we often need to predict one quantity from another:
| What We Measure | What We Want | Example |
|---|---|---|
| PWM command | Actual speed | Motor model |
| IMU vibration | Robot velocity | Speed estimation |
| Temperature | Resistance | Thermistor calibration |
| ADC value | Physical unit | Sensor linearization |
Regression finds the mathematical relationship between these quantities.
Linear Regression Derivation
Given data points \((x_1, y_1), (x_2, y_2), ..., (x_n, y_n)\), we want to find line \(y = mx + b\) that minimizes error.
Least Squares Objective:
Taking partial derivatives and setting to zero:
Implementation
def linear_regression(x_data, y_data):
"""
Fit y = mx + b using least squares.
Returns:
m: slope
b: intercept
r_squared: coefficient of determination
"""
n = len(x_data)
# Sums
sum_x = sum(x_data)
sum_y = sum(y_data)
sum_xy = sum(x * y for x, y in zip(x_data, y_data))
sum_x2 = sum(x * x for x in x_data)
# Slope and intercept
denominator = n * sum_x2 - sum_x * sum_x
m = (n * sum_xy - sum_x * sum_y) / denominator
b = (sum_y - m * sum_x) / n
# R-squared
y_mean = sum_y / n
ss_tot = sum((y - y_mean) ** 2 for y in y_data)
ss_res = sum((y - (m * x + b)) ** 2 for x, y in zip(x_data, y_data))
r_squared = 1 - ss_res / ss_tot if ss_tot > 0 else 0
return m, b, r_squared
Polynomial Regression
For non-linear relationships, extend to polynomials:
def polynomial_features(x, degree):
"""Generate polynomial features [1, x, x², ..., x^n]."""
return [x ** i for i in range(degree + 1)]
# Example: Quadratic fit for motor dead zone
# speed = a + b*pwm + c*pwm²
When to use polynomial: - Motor dead zone (speed vs PWM is not linear near zero) - Sensor non-linearity - Temperature compensation curves
2. Model Evaluation Metrics
R² (Coefficient of Determination)
Interpretation:
| R² Value | Meaning | Action |
|---|---|---|
| 0.95-1.0 | Excellent | Deploy with confidence |
| 0.85-0.95 | Good | Acceptable for most applications |
| 0.70-0.85 | Moderate | Consider additional features |
| < 0.70 | Poor | Model doesn't capture relationship |
Caution: High R² doesn't mean good predictions on new data!
RMSE (Root Mean Square Error)
Physical interpretation: RMSE has the same units as your output. If predicting speed in cm/s and RMSE = 3.5, predictions are typically off by ±3.5 cm/s.
Cross-Validation
Never evaluate on training data alone!
def train_test_split(x_data, y_data, test_fraction=0.2):
"""
Split data into training and testing sets.
Use training set to fit model.
Use test set to evaluate - this is your real performance!
"""
n = len(x_data)
n_test = int(n * test_fraction)
# Shuffle indices (simple random approach)
import random
indices = list(range(n))
random.shuffle(indices)
test_idx = indices[:n_test]
train_idx = indices[n_test:]
x_train = [x_data[i] for i in train_idx]
y_train = [y_data[i] for i in train_idx]
x_test = [x_data[i] for i in test_idx]
y_test = [y_data[i] for i in test_idx]
return x_train, y_train, x_test, y_test
Overfitting Warning
Symptoms of overfitting:
- Training R² = 0.98, Test R² = 0.60 ← Big gap!
- Model fits training data perfectly but fails on new data
- Using too many polynomial terms for small dataset
Prevention:
- Use simpler models (start with linear)
- Collect more data
- Always evaluate on held-out test set
3. System Identification
System identification (sysid) is the process of building mathematical models of dynamic systems from measured data.
First-Order System
Many physical systems can be modeled as first-order:
Where: - \(y\) = output (speed, temperature, etc.) - \(u\) = input (PWM, voltage, etc.) - \(K\) = steady-state gain - \(\tau\) = time constant
Step Response Method
Apply a step input and measure the response:
def identify_first_order(times, outputs, step_amplitude):
"""
Identify K and tau from step response data.
Assumes:
- Step input applied at t=0
- System starts at steady state
"""
# Find steady-state value
y_final = outputs[-1] # Assuming response has settled
y_initial = outputs[0]
# Gain
K = (y_final - y_initial) / step_amplitude
# Time constant: time to reach 63.2% of final value
y_63 = y_initial + 0.632 * (y_final - y_initial)
for i, y in enumerate(outputs):
if y >= y_63:
tau = times[i]
break
return K, tau
Motor System Identification
Step Response Experiment:
1. Robot stationary
2. Apply PWM = 100 suddenly
3. Measure speed over time
4. Fit first-order model
Result:
- K = 0.45 cm/s per PWM unit
- tau = 150 ms (motor response time)
Use for:
- Predicting response to commands
- Feed-forward control design
- Simulation
4. Practical Calibration Strategies
Multi-Point Calibration
For non-linear sensors, use lookup tables:
# Calibration points
cal_points = [
(100, 0), # ADC 100 = 0 cm
(200, 5.2), # ADC 200 = 5.2 cm
(350, 10.1), # ADC 350 = 10.1 cm
(500, 15.3), # ADC 500 = 15.3 cm
(700, 20.0), # ADC 700 = 20.0 cm
]
def calibrated_reading(adc_value):
"""Convert ADC to physical units using interpolation."""
for i in range(len(cal_points) - 1):
adc_low, val_low = cal_points[i]
adc_high, val_high = cal_points[i + 1]
if adc_low <= adc_value <= adc_high:
# Linear interpolation
fraction = (adc_value - adc_low) / (adc_high - adc_low)
return val_low + fraction * (val_high - val_low)
# Extrapolate if outside range (with warning)
return None
Temperature Compensation
Many sensors drift with temperature:
# Measure at two temperatures
# temp1 = 20°C, reading1 = 512
# temp2 = 40°C, reading2 = 525
temp_coefficient = (525 - 512) / (40 - 20) # = 0.65 per °C
def compensated_reading(raw_reading, current_temp, reference_temp=25):
"""Remove temperature drift from reading."""
temp_offset = (current_temp - reference_temp) * temp_coefficient
return raw_reading - temp_offset
Batch vs Online Calibration
| Approach | When to Use |
|---|---|
| Batch (one-time) | Factory calibration, stable environments |
| Online (continuous) | Adapting to changing conditions |
# Online calibration with exponential moving average
class OnlineCalibrator:
def __init__(self, alpha=0.01):
self.alpha = alpha # Learning rate
self.offset = 0
def update(self, measured, expected):
"""Update calibration based on ground truth."""
error = expected - measured
self.offset += self.alpha * error
def apply(self, raw_value):
"""Apply current calibration."""
return raw_value + self.offset
5. Experiments
Experiment 1: Regression Quality Study
Objective: Understand how data quantity affects model quality.
Procedure: 1. Collect 50+ calibration points 2. Train models with 5, 10, 20, 50 points 3. Evaluate all on same test set 4. Plot R² vs training set size
Expected Result:
Experiment 2: Model Comparison
Objective: Compare linear vs polynomial for motor model.
Procedure: 1. Collect PWM vs speed data (include dead zone region) 2. Fit linear model: speed = a + bPWM 3. Fit piecewise linear: speed = 0 if PWM < deadzone, else linear 4. Fit quadratic: speed = a + bPWM + c*PWM² 5. Compare R² and RMSE
Experiment 3: Cross-Surface Generalization
Objective: Test if model trained on one surface works on another.
Procedure: 1. Train motor model on smooth tile 2. Evaluate on: tile, carpet, wood 3. Measure degradation
Key Question: Do you need surface-specific calibration?
6. Common Pitfalls
Correlation vs Causation
Pitfall: "Vibration causes speed"
Reality: Speed causes vibration
Model still works for estimation, but:
- Can't control speed by controlling vibration
- Physical understanding matters for debugging
Extrapolation Danger
Training data: PWM 50-150
Model learned: speed = 0.45 * (PWM - 35)
At PWM = 200:
Model predicts: 74 cm/s
Reality: Motor saturates at 50 cm/s!
Rule: Never trust predictions outside training range.
Data Collection Bias
Problem: Only collected data going straight
Result: Model fails on curves (different wheel slip)
Solution: Collect data across operating conditions:
- Different speeds
- Straight and curved paths
- Multiple battery levels
- Different surfaces
7. Mini-Project: Adaptive Speed Estimation
Goal: Build a speed estimator that adapts to conditions.
Requirements: 1. Initial calibration from ground truth track 2. Fuse PWM command and IMU vibration 3. Detect when model is degraded (high prediction error) 4. Trigger recalibration when needed
Architecture:
┌──────────────────────────────────────────────────────┐
│ ADAPTIVE ESTIMATOR │
│ │
│ ┌──────────┐ ┌──────────┐ ┌──────────────┐ │
│ │ PWM │───►│ Motor │───►│ │ │
│ │ Command │ │ Model │ │ Weighted │───►│ Speed
│ └──────────┘ └──────────┘ │ Fusion │ │ Estimate
│ │ │ │
│ ┌──────────┐ ┌──────────┐ │ │ │
│ │ IMU │───►│ Vibration│───►│ │ │
│ │ Reading │ │ Model │ └──────────────┘ │
│ └──────────┘ └──────────┘ ▲ │
│ │ │
│ ┌──────────────────────────────────────┴──┐ │
│ │ Confidence Estimator │ │
│ │ (detect degraded predictions) │ │
│ └──────────────────────────────────────────┘ │
└──────────────────────────────────────────────────────┘
8. Further Reading
Textbooks
- System Identification: Theory for the User - Ljung (classic reference)
- Data-Driven Science and Engineering - Brunton & Kutz (modern, practical)
Online Resources
- 3Blue1Brown: Linear Regression - Visual intuition
- Khan Academy: Regression - Interactive exercises
Related Courses
- ES102: Advanced system identification techniques
- Control Theory: Model-based control design
Summary
| Concept | Key Takeaway |
|---|---|
| Regression | Finds relationship from data, not physics |
| R² | How much variation explained (0-1) |
| RMSE | Average prediction error in physical units |
| Cross-validation | Always test on held-out data |
| System ID | Build dynamic models from step response |
| Calibration | Multi-point, temperature compensation |
The Engineering Method: 1. Collect data systematically 2. Fit simple model first 3. Evaluate on test data 4. Understand failure modes 5. Deploy with known limitations