Data-Driven Methods

Advanced Track

This module extends Lab 07: Data Analysis with deeper theory on regression, system identification, and model validation.

Prerequisites

Completed Lab 07 (basic data collection and model fitting)
Understanding of linear relationships
Basic statistics (mean, standard deviation)

1. Regression Theory

Why Regression?

In embedded systems, we often need to predict one quantity from another:

What We Measure	What We Want	Example
PWM command	Actual speed	Motor model
IMU vibration	Robot velocity	Speed estimation
Temperature	Resistance	Thermistor calibration
ADC value	Physical unit	Sensor linearization

Regression finds the mathematical relationship between these quantities.

Linear Regression Derivation

Given data points \((x_1, y_1), (x_2, y_2), ..., (x_n, y_n)\), we want to find line \(y = mx + b\) that minimizes error.

Least Squares Objective:

\[E = \sum_{i=1}^{n} (y_i - (mx_i + b))^2\]

Taking partial derivatives and setting to zero:

\[\frac{\partial E}{\partial m} = 0 \implies m = \frac{n\sum x_i y_i - \sum x_i \sum y_i}{n\sum x_i^2 - (\sum x_i)^2}\]

\[\frac{\partial E}{\partial b} = 0 \implies b = \bar{y} - m\bar{x}\]

Implementation

def linear_regression(x_data, y_data):
    """
    Fit y = mx + b using least squares.

    Returns:
        m: slope
        b: intercept
        r_squared: coefficient of determination
    """
    n = len(x_data)

    # Sums
    sum_x = sum(x_data)
    sum_y = sum(y_data)
    sum_xy = sum(x * y for x, y in zip(x_data, y_data))
    sum_x2 = sum(x * x for x in x_data)

    # Slope and intercept
    denominator = n * sum_x2 - sum_x * sum_x
    m = (n * sum_xy - sum_x * sum_y) / denominator
    b = (sum_y - m * sum_x) / n

    # R-squared
    y_mean = sum_y / n
    ss_tot = sum((y - y_mean) ** 2 for y in y_data)
    ss_res = sum((y - (m * x + b)) ** 2 for x, y in zip(x_data, y_data))
    r_squared = 1 - ss_res / ss_tot if ss_tot > 0 else 0

    return m, b, r_squared

Polynomial Regression

For non-linear relationships, extend to polynomials:

\[y = a_0 + a_1 x + a_2 x^2 + ... + a_n x^n\]

def polynomial_features(x, degree):
    """Generate polynomial features [1, x, x², ..., x^n]."""
    return [x ** i for i in range(degree + 1)]

# Example: Quadratic fit for motor dead zone
# speed = a + b*pwm + c*pwm²

When to use polynomial: - Motor dead zone (speed vs PWM is not linear near zero) - Sensor non-linearity - Temperature compensation curves

2. Model Evaluation Metrics

R² (Coefficient of Determination)

Interpretation:

R² Value	Meaning	Action
0.95-1.0	Excellent	Deploy with confidence
0.85-0.95	Good	Acceptable for most applications
0.70-0.85	Moderate	Consider additional features
< 0.70	Poor	Model doesn't capture relationship

Caution: High R² doesn't mean good predictions on new data!

RMSE (Root Mean Square Error)

\[\text{RMSE} = \sqrt{\frac{1}{n}\sum_{i=1}^{n}(y_i - \hat{y}_i)^2}\]

Physical interpretation: RMSE has the same units as your output. If predicting speed in cm/s and RMSE = 3.5, predictions are typically off by ±3.5 cm/s.

Cross-Validation

Never evaluate on training data alone!

def train_test_split(x_data, y_data, test_fraction=0.2):
    """
    Split data into training and testing sets.

    Use training set to fit model.
    Use test set to evaluate - this is your real performance!
    """
    n = len(x_data)
    n_test = int(n * test_fraction)

    # Shuffle indices (simple random approach)
    import random
    indices = list(range(n))
    random.shuffle(indices)

    test_idx = indices[:n_test]
    train_idx = indices[n_test:]

    x_train = [x_data[i] for i in train_idx]
    y_train = [y_data[i] for i in train_idx]
    x_test = [x_data[i] for i in test_idx]
    y_test = [y_data[i] for i in test_idx]

    return x_train, y_train, x_test, y_test

Overfitting Warning

Symptoms of overfitting:
- Training R² = 0.98, Test R² = 0.60  ← Big gap!
- Model fits training data perfectly but fails on new data
- Using too many polynomial terms for small dataset

Prevention:
- Use simpler models (start with linear)
- Collect more data
- Always evaluate on held-out test set

3. System Identification

System identification (sysid) is the process of building mathematical models of dynamic systems from measured data.

First-Order System

Many physical systems can be modeled as first-order:

\[\tau \frac{dy}{dt} + y = K \cdot u\]

Where: - \(y\) = output (speed, temperature, etc.) - \(u\) = input (PWM, voltage, etc.) - \(K\) = steady-state gain - \(\tau\) = time constant

Step Response Method

Apply a step input and measure the response:

def identify_first_order(times, outputs, step_amplitude):
    """
    Identify K and tau from step response data.

    Assumes:
    - Step input applied at t=0
    - System starts at steady state
    """
    # Find steady-state value
    y_final = outputs[-1]  # Assuming response has settled
    y_initial = outputs[0]

    # Gain
    K = (y_final - y_initial) / step_amplitude

    # Time constant: time to reach 63.2% of final value
    y_63 = y_initial + 0.632 * (y_final - y_initial)

    for i, y in enumerate(outputs):
        if y >= y_63:
            tau = times[i]
            break

    return K, tau

Motor System Identification

Step Response Experiment:
1. Robot stationary
2. Apply PWM = 100 suddenly
3. Measure speed over time
4. Fit first-order model

Result:
- K = 0.45 cm/s per PWM unit
- tau = 150 ms (motor response time)

Use for:
- Predicting response to commands
- Feed-forward control design
- Simulation

4. Practical Calibration Strategies

Multi-Point Calibration

For non-linear sensors, use lookup tables:

# Calibration points
cal_points = [
    (100, 0),     # ADC 100 = 0 cm
    (200, 5.2),   # ADC 200 = 5.2 cm
    (350, 10.1),  # ADC 350 = 10.1 cm
    (500, 15.3),  # ADC 500 = 15.3 cm
    (700, 20.0),  # ADC 700 = 20.0 cm
]

def calibrated_reading(adc_value):
    """Convert ADC to physical units using interpolation."""
    for i in range(len(cal_points) - 1):
        adc_low, val_low = cal_points[i]
        adc_high, val_high = cal_points[i + 1]

        if adc_low <= adc_value <= adc_high:
            # Linear interpolation
            fraction = (adc_value - adc_low) / (adc_high - adc_low)
            return val_low + fraction * (val_high - val_low)

    # Extrapolate if outside range (with warning)
    return None

Temperature Compensation

Many sensors drift with temperature:

# Measure at two temperatures
# temp1 = 20°C, reading1 = 512
# temp2 = 40°C, reading2 = 525

temp_coefficient = (525 - 512) / (40 - 20)  # = 0.65 per °C

def compensated_reading(raw_reading, current_temp, reference_temp=25):
    """Remove temperature drift from reading."""
    temp_offset = (current_temp - reference_temp) * temp_coefficient
    return raw_reading - temp_offset

Batch vs Online Calibration

Approach	When to Use
Batch (one-time)	Factory calibration, stable environments
Online (continuous)	Adapting to changing conditions

# Online calibration with exponential moving average
class OnlineCalibrator:
    def __init__(self, alpha=0.01):
        self.alpha = alpha  # Learning rate
        self.offset = 0

    def update(self, measured, expected):
        """Update calibration based on ground truth."""
        error = expected - measured
        self.offset += self.alpha * error

    def apply(self, raw_value):
        """Apply current calibration."""
        return raw_value + self.offset

5. Experiments

Experiment 1: Regression Quality Study

Objective: Understand how data quantity affects model quality.

Procedure: 1. Collect 50+ calibration points 2. Train models with 5, 10, 20, 50 points 3. Evaluate all on same test set 4. Plot R² vs training set size

Expected Result:

Training Points | Test R²
5               | 0.65
10              | 0.78
20              | 0.88
50              | 0.93

Experiment 2: Model Comparison

Objective: Compare linear vs polynomial for motor model.

Procedure: 1. Collect PWM vs speed data (include dead zone region) 2. Fit linear model: speed = a + bPWM 3. Fit piecewise linear: speed = 0 if PWM < deadzone, else linear 4. Fit quadratic: speed = a + bPWM + c*PWM² 5. Compare R² and RMSE

Experiment 3: Cross-Surface Generalization

Objective: Test if model trained on one surface works on another.

Procedure: 1. Train motor model on smooth tile 2. Evaluate on: tile, carpet, wood 3. Measure degradation

Key Question: Do you need surface-specific calibration?

6. Common Pitfalls

Correlation vs Causation

Pitfall: "Vibration causes speed"
Reality: Speed causes vibration

Model still works for estimation, but:
- Can't control speed by controlling vibration
- Physical understanding matters for debugging

Extrapolation Danger

Training data: PWM 50-150
Model learned: speed = 0.45 * (PWM - 35)

At PWM = 200:
Model predicts: 74 cm/s
Reality: Motor saturates at 50 cm/s!

Rule: Never trust predictions outside training range.

Data Collection Bias

Problem: Only collected data going straight
Result: Model fails on curves (different wheel slip)

Solution: Collect data across operating conditions:
- Different speeds
- Straight and curved paths
- Multiple battery levels
- Different surfaces

7. Mini-Project: Adaptive Speed Estimation

Goal: Build a speed estimator that adapts to conditions.

Requirements: 1. Initial calibration from ground truth track 2. Fuse PWM command and IMU vibration 3. Detect when model is degraded (high prediction error) 4. Trigger recalibration when needed

Architecture:

┌──────────────────────────────────────────────────────┐
│                  ADAPTIVE ESTIMATOR                    │
│                                                        │
│  ┌──────────┐    ┌──────────┐    ┌──────────────┐    │
│  │   PWM    │───►│  Motor   │───►│              │    │
│  │ Command  │    │  Model   │    │   Weighted   │───►│ Speed
│  └──────────┘    └──────────┘    │   Fusion     │    │ Estimate
│                                  │              │    │
│  ┌──────────┐    ┌──────────┐    │              │    │
│  │   IMU    │───►│ Vibration│───►│              │    │
│  │ Reading  │    │  Model   │    └──────────────┘    │
│  └──────────┘    └──────────┘           ▲            │
│                                         │            │
│  ┌──────────────────────────────────────┴──┐         │
│  │        Confidence Estimator             │         │
│  │   (detect degraded predictions)          │         │
│  └──────────────────────────────────────────┘         │
└──────────────────────────────────────────────────────┘

8. Further Reading

Textbooks

System Identification: Theory for the User - Ljung (classic reference)
Data-Driven Science and Engineering - Brunton & Kutz (modern, practical)

Online Resources

3Blue1Brown: Linear Regression - Visual intuition
Khan Academy: Regression - Interactive exercises

ES102: Advanced system identification techniques
Control Theory: Model-based control design

Summary

Concept	Key Takeaway
Regression	Finds relationship from data, not physics
R²	How much variation explained (0-1)
RMSE	Average prediction error in physical units
Cross-validation	Always test on held-out data
System ID	Build dynamic models from step response
Calibration	Multi-point, temperature compensation

The Engineering Method: 1. Collect data systematically 2. Fit simple model first 3. Evaluate on test data 4. Understand failure modes 5. Deploy with known limitations