聊天視窗

Data Science Unlocked: A Practical Guide for Modern Analysts - 第 5 章

Chapter 5: Supervised Learning in Practice

發布於 2026-02-23 16:39

# Chapter 5: Supervised Learning in Practice > **Goal** – Translate the patterns discovered during EDA into predictive models that provide actionable business insights. This chapter walks through regression, classification, validation, and hyper‑parameter tuning using the most popular Python libraries. ## 5.1 From Insight to Prediction - **Define the objective**: *What* you are trying to predict and *why* it matters. - **Select the target**: Continuous for regression, categorical for classification. - **Align with stakeholders**: Ensure the model’s output can be interpreted and acted upon. | Task | Typical Target | Common Use‑Cases | |------|----------------|-------------------| | Regression | `float` | House price, sales forecast, risk score | | Classification | `int`/`str` | Customer churn, fraud detection, disease diagnosis | ## 5.2 Project Setup & Data Pipeline python # Imports – keep dependencies minimal for reproducibility import pandas as pd import numpy as np from sklearn.model_selection import train_test_split # Load pre‑cleaned data from the previous chapter df = pd.read_csv('data/cleaned_dataset.csv') # Basic sanity checks – missing values, outlier counts, correlation heatmap print(df.isna().sum()) print(df.describe().T) **Tip**: Store the split indices (`train_idx`, `test_idx`) in a JSON file so that the same split can be reused for experiments and production. ## 5.3 Regression Fundamentals ### 5.3.1 Linear Regression python from sklearn.linear_model import LinearRegression from sklearn.metrics import mean_squared_error, r2_score X = df.drop(columns=["price"]) y = df["price"] X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=42) lr = LinearRegression() lr.fit(X_train, y_train) pred = lr.predict(X_val) print("RMSE: {:.2f}".format(np.sqrt(mean_squared_error(y_val, pred)))) print("R²: {:.2f}".format(r2_score(y_val, pred))) > **Interpretation** – The coefficient of each feature tells you how a one‑unit increase in that feature shifts the predicted target, assuming all other features are held constant. ### 5.3.2 Regularized Regression (Ridge & Lasso) python from sklearn.linear_model import Ridge, Lasso ridge = Ridge(alpha=1.0) lasso = Lasso(alpha=0.1) - **Ridge** shrinks coefficients but keeps all features. - **Lasso** performs feature selection by driving some coefficients to zero. ## 5.4 Classification Essentials ### 5.4.1 Logistic Regression (Baseline) python from sklearn.linear_model import LogisticRegression from sklearn.metrics import classification_report, roc_auc_score X = df.drop(columns=["churn"]) y = df["churn"] X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y) logreg = LogisticRegression(max_iter=200) logreg.fit(X_train, y_train) pred_proba = logreg.predict_proba(X_val)[:, 1] print("ROC‑AUC: {:.3f}".format(roc_auc_score(y_val, pred_proba))) print(classification_report(y_val)) ### 5.4.2 Tree‑Based Ensembles – XGBoost & LightGBM python import xgboost as xgb import lightgbm as lgb # XGBoost xgb_model = xgb.XGBClassifier(n_estimators=300, learning_rate=0.05, max_depth=4, subsample=0.8, colsample_bytree=0.8) xgb_model.fit(X_train, y_train) print("XGBoost ROC‑AUC: {:.3f}".format(roc_auc_score(y_val, xgb_model.predict_proba(X_val)[:, 1]))) # LightGBM lgb_model = lgb.LGBMClassifier(n_estimators=300, learning_rate=0.05, max_depth=4, subsample=0.8, colsample_bytree=0.8) lgb_model.fit(X_train, y_train) print("LightGBM ROC‑AUC: {:.3f}".format(roc_auc_score(y_val, lgb_model.predict_proba(X_val)[:, 1]))) > **Why ensembles?** They capture non‑linear interactions and often outperform single‑tree or linear baselines on tabular data. ## 5.5 Cross‑Validation Strategies | Scenario | Recommended CV | Key Parameters | |----------|----------------|----------------| | Small dataset | KFold (k=5–10) | `shuffle=True`, `random_state` | | Classification with imbalance | StratifiedKFold | `n_splits`, `shuffle` | | Time‑series | TimeSeriesSplit | `n_splits` | python from sklearn.model_selection import StratifiedKFold, cross_val_score skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42) cv_scores = cross_val_score(lgb_model, X, y, cv=skf, scoring='roc_auc') print("CV ROC‑AUC: {:.3f} ± {:.3f}".format(cv_scores.mean(), cv_scores.std())) ## 5.6 Hyper‑Parameter Tuning ### 5.6.1 Grid Search (exhaustive) python from sklearn.model_selection import GridSearchCV param_grid = { 'n_estimators': [200, 400], 'learning_rate': [0.05, 0.1], 'max_depth': [3, 5, 7] } grid = GridSearchCV(lgb_model, param_grid, cv=skf, scoring='roc_auc', n_jobs=-1) grid.fit(X_train, y_train) print("Best params:", grid.best_params_) print("Best ROC‑AUC:", grid.best_score_) ### 5.6.2 Randomized Search (sampling) python from sklearn.model_selection import RandomizedSearchCV from scipy.stats import uniform param_dist = { 'n_estimators': [100, 200, 400, 600], 'learning_rate': uniform(0.01, 0.2), 'max_depth': [3, 5, 7, 9] } rand = RandomizedSearchCV(lgb_model, param_dist, n_iter=50, cv=skf, scoring='roc_auc', n_jobs=-1, random_state=42) rand.fit(X_train, y_train) print("Best params:", rand.best_params_) print("Best ROC‑AUC:", rand.best_score_) ### 5.6.3 Bayesian Optimization (Optuna) python import optuna from optuna.integration import LightGBMPruningCallback def objective(trial): param = { 'n_estimators': trial.suggest_int('n_estimators', 200, 600), 'learning_rate': trial.suggest_float('learning_rate', 0.01, 0.2, log=True), 'max_depth': trial.suggest_int('max_depth', 3, 9), 'subsample': trial.suggest_float('subsample', 0.6, 1.0), 'colsample_bytree': trial.suggest_float('colsample_bytree', 0.6, 1.0), 'objective': 'binary', 'metric': 'auc', 'verbosity': -1 } dtrain = lgb.Dataset(X_train, label=y_train) eval_set = [(X_val, y_val)] booster = lgb.train(param, dtrain, num_boost_round=param["n_estimators"], valid_sets=eval_set, early_stopping_rounds=50, verbose_eval=False, callbacks=[LightGBMPruningCallback(trial, "auc")]) preds = booster.predict(X_val, num_iteration=booster.best_iteration) return roc_auc_score(y_val, preds) study = optuna.create_study(direction='maximize') study.optimize(objective, n_trials=100) print("Best trial:", study.best_trial.params) print("Best AUC:", study.best_value) ## 5.7 Model Evaluation & Validation | Metric | Regression | Classification | |--------|------------|----------------| | RMSE | Root Mean Squared Error | – | | R² | Coefficient of Determination | – | | Accuracy | – | Accuracy (top‑1) | | Precision | – | Precision (Positive Rate) | | Recall | – | Recall (Sensitivity) | | F1 | – | Harmonic mean of Precision & Recall | | ROC‑AUC | – | Area under ROC curve | | PR‑AUC | – | Area under Precision‑Recall curve | **Visual diagnostics**: - **Residual plots** for regression. - **Calibration curves** for probability‑based classifiers. - **Feature importance** to ensure the model isn’t relying on spurious signals. python from sklearn.calibration import calibration_curve import matplotlib.pyplot as plt prob_true, prob_pred = calibration_curve(y_val, preds, n_bins=10) plt.plot(prob_true, prob_pred, marker='o') plt.xlabel("Observed") plt.ylabel("Predicted") plt.title("Calibration Plot") plt.show() ## 5.8 Production‑Ready Checklist 1. **Reproducibility** – Fix `random_state` and store all training artefacts (`model.joblib`, `feature_names.pkl`). 2. **Scalability** – Leverage `joblib`’s `dump`/`load` for serialisation, or expose the model via a REST API using `FastAPI`. 3. **Monitoring** – Create a drift‑detection job that recomputes feature statistics every week. 4. **Explainability** – Generate SHAP plots: python import shap shap.initjs() explainer = shap.TreeExplainer(lgb_model) shap_values = explainer.shap_values(X_val) shap.summary_plot(shap_values, X_val) > **Business impact** – A single SHAP summary can translate to a *scorecard* that the risk team can incorporate into their decision pipeline. ## 5.8 Recap – A Quick‑Start Workflow python # 1. Load data → 2. Train/test split → 3. Choose baseline → 4. CV → 5. Tune hyper‑parameters → 6. Evaluate → 7. Deploy > **Remember** – The *greatest value* often comes from a small, well‑documented baseline that is easy to explain. Complex models should only be introduced after the baseline fails to meet business SLAs. --- > **Next step** – In the next chapter we cover how to embed these models into an automated production pipeline and monitor their performance in real‑time. --- *Prepared by the Data Science Engineering team – © 2023*