Chapter 4: From Insight to Prediction – Building Supervised Learning Models

發布於 2026-02-23 09:30

# Chapter 4: From Insight to Prediction – Building Supervised Learning Models Supervised learning turns descriptive insight into predictive power. This chapter translates the exploratory findings from Chapter 3 into a concrete modeling workflow that a data‑science team can deploy, monitor, and iterate on. ## 1. Recap of Business Goals | Business Question | Decision Impact | |-------------------|----------------- | **Can we predict which customers will churn within 90 days?** | Reduce churn‑related revenue loss by 15 %. | **What factors most strongly influence repeat purchases?** | Target marketing spend on high‑impact segments. The modeling task is a binary classification: *churn* = 1, *no churn* = 0. ## 2. Feature Engineering Revisited | Feature Category | Why it Matters | Transformation |-------------------|----------------|--------------- | **Customer‑level** | Captures long‑term behavior | `avg_daily_spend`, `purchase_frequency` | **Recency–Frequency–Monetary (RFM)** | Classic churn proxy | `recency_days`, `frequency_last_90`, `monetary_last_90` | **Engagement** | Interaction signals | `logins_last_30`, `email_click_rate` | **Demographics** | Contextual factors | One‑hot encode `region`, `subscription_type` > **Tip:** Keep the feature set versioned. If new data sources become available, retrain a *new* model rather than mutating the existing pipeline. ## 3. Choosing a Base Model | Algorithm | Typical Use‑Case | Pros | Cons | |-----------|------------------|------|------| | **Logistic Regression** | Baseline, interpretable | Fast, transparent | Linear decision boundary | | **Gradient‑Boosted Trees (XGBoost / LightGBM)** | Strong performance on tabular data | Handles interactions, missing values | Requires careful tuning | | **Random Forest** | Robust to over‑fitting | Easy to parallelize | Less interpretable | | **Neural Network** | Complex nonlinear patterns | Flexible | Needs large data & hyper‑opt | We will showcase a **LightGBM** pipeline because it balances speed, accuracy, and interpretability for business use. ## 4. Pipeline Construction (Python pseudocode) python import pandas as pd from sklearn.model_selection import train_test_split, StratifiedKFold from sklearn.metrics import roc_auc_score, f1_score import lightgbm as lgb # 1. Load and merge data X = pd.read_csv('features.csv') y = pd.read_csv('labels.csv')['churn'] # 2. Train‑test split preserving churn ratio X_train, X_valid, y_train, y_valid = train_test_split( X, y, test_size=0.2, stratify=y, random_state=42 ) # 3. LightGBM dataset wrappers train_data = lgb.Dataset(X_train, label=y_train) valid_data = lgb.Dataset(X_valid, label=y_valid, reference=train_data) # 4. Parameter grid params = { 'objective': 'binary', 'metric': 'auc', 'boosting_type': 'gbdt', 'num_leaves': 31, 'learning_rate': 0.05, 'feature_fraction': 0.9, 'bagging_fraction': 0.8, 'bagging_freq': 5, 'verbosity': -1, 'seed': 42 } # 5. Train with early stopping model = lgb.train( params, train_data, num_boost_round=5000, valid_sets=[train_data, valid_data], valid_names=['train','valid'], early_stopping_rounds=200, verbose_eval=100 ) # 6. Evaluate pred = model.predict(X_valid, num_iteration=model.best_iteration) print('ROC‑AUC:', roc_auc_score(y_valid, pred)) print('F1‑Score:', f1_score(y_valid, pred > 0.5)) **Key Takeaways**: * Use **StratifiedKFold** for cross‑validation to maintain churn ratio. * Set a realistic `num_boost_round` and rely on early stopping. * Store the *best_iteration* for reproducibility. ## 5. Model Evaluation Metrics | Metric | What It Measures | Why It Matters for Churn | |--------|------------------|--------------------------| | **ROC‑AUC** | Ability to rank positive vs. negative | High values mean we can distinguish churners early. | **Precision / Recall** | Balance between false positives / negatives | Business wants to avoid over‑reacting to false churn alerts. | **F1‑Score** | Harmonic mean of precision & recall | One‑stop‑shop for overall predictive quality. | **Calibration** | Predictive probabilities vs. observed churn | Enables risk‑based pricing or targeted retention offers. A practical rule of thumb: *ROC‑AUC > 0.70* is acceptable for many commercial datasets; *> 0.80* is considered strong. ## 6. Bias‑Variance Trade‑off in Practice 1. **Under‑fitting** – Too few leaves or too high learning rate. Symptoms: low training & validation scores. 2. **Over‑fitting** – Too many trees or too low regularisation. Symptoms: high training score, low validation score. 3. **Regularisation** – Tune `lambda_l1`, `lambda_l2`, `min_data_in_leaf`. 4. **Early Stopping** – Prevents excessive training once validation performance plateaus. Use **learning curves** to visualize the trade‑off. ## 7. Interpretability and Trust * **SHAP values** (TreeExplainer) reveal feature impact per prediction. * **Feature importance** from LightGBM (`model.feature_importance()`). * **Partial Dependence Plots** to understand non‑linear effects. Business stakeholders demand *explainability* before deploying a model. Present a **feature‑impact dashboard** that can be shared with marketing and finance teams. ## 8. Ethical and Fairness Considerations | Issue | Checklist | |-------|-----------| | **Data Privacy** | Ensure GDPR‑style consent for customer data. | | **Bias** | Check disparate impact across demographic groups (e.g., gender, region). | | **Transparency** | Document the entire pipeline in a reproducible notebook (Git‑tracked). | | **Model Drift** | Schedule monthly EDA + model re‑training if AUC drops >5 %. | The goal is to avoid *profiling* customers unfairly while still enabling targeted retention. ## 9. Deployment Pathways 1. **Batch Prediction** – Generate churn scores nightly, load into the CRM. 2. **Real‑time API** – Wrap the model in a Flask/FastAPI service for instant scoring during web sessions. 3. **Feature Store** – Push engineered features into a central store (e.g., Feast) to keep training and serving aligned. Document the deployment steps in a **CI/CD pipeline** (GitHub Actions + Docker). Store the final model in a versioned artifact repository (e.g., MLflow, DVC). ## 10. Closing Thought Building a model is not a one‑off task; it’s a **continuous improvement loop**. By keeping your data pipeline reproducible, your metrics transparent, and your ethical guardrails in place, you convert raw EDA insights into business‑driving predictions that can be trusted across the organization. --- *Next up: Chapter 5 – Operationalizing Models: From Pipeline to Profit.*

Chapter 3: Exploratory Data Analysis (EDA)

Chapter 5 – Operationalizing Models: From Pipeline to Profit