返回目錄
A
Data Science Unlocked: A Practical Guide for Modern Analysts - 第 5 章
Chapter 5: Supervised Learning in Practice
發布於 2026-02-23 16:39
# Chapter 5: Supervised Learning in Practice
> **Goal** – Translate the patterns discovered during EDA into predictive models that provide actionable business insights. This chapter walks through regression, classification, validation, and hyper‑parameter tuning using the most popular Python libraries.
## 5.1 From Insight to Prediction
- **Define the objective**: *What* you are trying to predict and *why* it matters.
- **Select the target**: Continuous for regression, categorical for classification.
- **Align with stakeholders**: Ensure the model’s output can be interpreted and acted upon.
| Task | Typical Target | Common Use‑Cases |
|------|----------------|-------------------|
| Regression | `float` | House price, sales forecast, risk score |
| Classification | `int`/`str` | Customer churn, fraud detection, disease diagnosis |
## 5.2 Project Setup & Data Pipeline
python
# Imports – keep dependencies minimal for reproducibility
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
# Load pre‑cleaned data from the previous chapter
df = pd.read_csv('data/cleaned_dataset.csv')
# Basic sanity checks – missing values, outlier counts, correlation heatmap
print(df.isna().sum())
print(df.describe().T)
**Tip**: Store the split indices (`train_idx`, `test_idx`) in a JSON file so that the same split can be reused for experiments and production.
## 5.3 Regression Fundamentals
### 5.3.1 Linear Regression
python
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
X = df.drop(columns=["price"])
y = df["price"]
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=42)
lr = LinearRegression()
lr.fit(X_train, y_train)
pred = lr.predict(X_val)
print("RMSE: {:.2f}".format(np.sqrt(mean_squared_error(y_val, pred))))
print("R²: {:.2f}".format(r2_score(y_val, pred)))
> **Interpretation** – The coefficient of each feature tells you how a one‑unit increase in that feature shifts the predicted target, assuming all other features are held constant.
### 5.3.2 Regularized Regression (Ridge & Lasso)
python
from sklearn.linear_model import Ridge, Lasso
ridge = Ridge(alpha=1.0)
lasso = Lasso(alpha=0.1)
- **Ridge** shrinks coefficients but keeps all features.
- **Lasso** performs feature selection by driving some coefficients to zero.
## 5.4 Classification Essentials
### 5.4.1 Logistic Regression (Baseline)
python
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, roc_auc_score
X = df.drop(columns=["churn"])
y = df["churn"]
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)
logreg = LogisticRegression(max_iter=200)
logreg.fit(X_train, y_train)
pred_proba = logreg.predict_proba(X_val)[:, 1]
print("ROC‑AUC: {:.3f}".format(roc_auc_score(y_val, pred_proba)))
print(classification_report(y_val))
### 5.4.2 Tree‑Based Ensembles – XGBoost & LightGBM
python
import xgboost as xgb
import lightgbm as lgb
# XGBoost
xgb_model = xgb.XGBClassifier(n_estimators=300, learning_rate=0.05, max_depth=4, subsample=0.8, colsample_bytree=0.8)
xgb_model.fit(X_train, y_train)
print("XGBoost ROC‑AUC: {:.3f}".format(roc_auc_score(y_val, xgb_model.predict_proba(X_val)[:, 1])))
# LightGBM
lgb_model = lgb.LGBMClassifier(n_estimators=300, learning_rate=0.05, max_depth=4, subsample=0.8, colsample_bytree=0.8)
lgb_model.fit(X_train, y_train)
print("LightGBM ROC‑AUC: {:.3f}".format(roc_auc_score(y_val, lgb_model.predict_proba(X_val)[:, 1])))
> **Why ensembles?** They capture non‑linear interactions and often outperform single‑tree or linear baselines on tabular data.
## 5.5 Cross‑Validation Strategies
| Scenario | Recommended CV | Key Parameters |
|----------|----------------|----------------|
| Small dataset | KFold (k=5–10) | `shuffle=True`, `random_state` |
| Classification with imbalance | StratifiedKFold | `n_splits`, `shuffle` |
| Time‑series | TimeSeriesSplit | `n_splits` |
python
from sklearn.model_selection import StratifiedKFold, cross_val_score
skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
cv_scores = cross_val_score(lgb_model, X, y, cv=skf, scoring='roc_auc')
print("CV ROC‑AUC: {:.3f} ± {:.3f}".format(cv_scores.mean(), cv_scores.std()))
## 5.6 Hyper‑Parameter Tuning
### 5.6.1 Grid Search (exhaustive)
python
from sklearn.model_selection import GridSearchCV
param_grid = {
'n_estimators': [200, 400],
'learning_rate': [0.05, 0.1],
'max_depth': [3, 5, 7]
}
grid = GridSearchCV(lgb_model, param_grid, cv=skf, scoring='roc_auc', n_jobs=-1)
grid.fit(X_train, y_train)
print("Best params:", grid.best_params_)
print("Best ROC‑AUC:", grid.best_score_)
### 5.6.2 Randomized Search (sampling)
python
from sklearn.model_selection import RandomizedSearchCV
from scipy.stats import uniform
param_dist = {
'n_estimators': [100, 200, 400, 600],
'learning_rate': uniform(0.01, 0.2),
'max_depth': [3, 5, 7, 9]
}
rand = RandomizedSearchCV(lgb_model, param_dist, n_iter=50, cv=skf, scoring='roc_auc', n_jobs=-1, random_state=42)
rand.fit(X_train, y_train)
print("Best params:", rand.best_params_)
print("Best ROC‑AUC:", rand.best_score_)
### 5.6.3 Bayesian Optimization (Optuna)
python
import optuna
from optuna.integration import LightGBMPruningCallback
def objective(trial):
param = {
'n_estimators': trial.suggest_int('n_estimators', 200, 600),
'learning_rate': trial.suggest_float('learning_rate', 0.01, 0.2, log=True),
'max_depth': trial.suggest_int('max_depth', 3, 9),
'subsample': trial.suggest_float('subsample', 0.6, 1.0),
'colsample_bytree': trial.suggest_float('colsample_bytree', 0.6, 1.0),
'objective': 'binary',
'metric': 'auc',
'verbosity': -1
}
dtrain = lgb.Dataset(X_train, label=y_train)
eval_set = [(X_val, y_val)]
booster = lgb.train(param, dtrain, num_boost_round=param["n_estimators"], valid_sets=eval_set, early_stopping_rounds=50, verbose_eval=False, callbacks=[LightGBMPruningCallback(trial, "auc")])
preds = booster.predict(X_val, num_iteration=booster.best_iteration)
return roc_auc_score(y_val, preds)
study = optuna.create_study(direction='maximize')
study.optimize(objective, n_trials=100)
print("Best trial:", study.best_trial.params)
print("Best AUC:", study.best_value)
## 5.7 Model Evaluation & Validation
| Metric | Regression | Classification |
|--------|------------|----------------|
| RMSE | Root Mean Squared Error | – |
| R² | Coefficient of Determination | – |
| Accuracy | – | Accuracy (top‑1) |
| Precision | – | Precision (Positive Rate) |
| Recall | – | Recall (Sensitivity) |
| F1 | – | Harmonic mean of Precision & Recall |
| ROC‑AUC | – | Area under ROC curve |
| PR‑AUC | – | Area under Precision‑Recall curve |
**Visual diagnostics**:
- **Residual plots** for regression.
- **Calibration curves** for probability‑based classifiers.
- **Feature importance** to ensure the model isn’t relying on spurious signals.
python
from sklearn.calibration import calibration_curve
import matplotlib.pyplot as plt
prob_true, prob_pred = calibration_curve(y_val, preds, n_bins=10)
plt.plot(prob_true, prob_pred, marker='o')
plt.xlabel("Observed")
plt.ylabel("Predicted")
plt.title("Calibration Plot")
plt.show()
## 5.8 Production‑Ready Checklist
1. **Reproducibility** – Fix `random_state` and store all training artefacts (`model.joblib`, `feature_names.pkl`).
2. **Scalability** – Leverage `joblib`’s `dump`/`load` for serialisation, or expose the model via a REST API using `FastAPI`.
3. **Monitoring** – Create a drift‑detection job that recomputes feature statistics every week.
4. **Explainability** – Generate SHAP plots:
python
import shap
shap.initjs()
explainer = shap.TreeExplainer(lgb_model)
shap_values = explainer.shap_values(X_val)
shap.summary_plot(shap_values, X_val)
> **Business impact** – A single SHAP summary can translate to a *scorecard* that the risk team can incorporate into their decision pipeline.
## 5.8 Recap – A Quick‑Start Workflow
python
# 1. Load data → 2. Train/test split → 3. Choose baseline → 4. CV → 5. Tune hyper‑parameters → 6. Evaluate → 7. Deploy
> **Remember** – The *greatest value* often comes from a small, well‑documented baseline that is easy to explain. Complex models should only be introduced after the baseline fails to meet business SLAs.
---
> **Next step** – In the next chapter we cover how to embed these models into an automated production pipeline and monitor their performance in real‑time.
---
*Prepared by the Data Science Engineering team – © 2023*