返回目錄
A
Data Science Demystified: A Pragmatic Guide for Business Decision-Makers - 第 4 章
Chapter 4: From Insight to Prediction – Building Supervised Learning Models
發布於 2026-02-23 09:30
# Chapter 4: From Insight to Prediction – Building Supervised Learning Models
Supervised learning turns descriptive insight into predictive power. This chapter translates the exploratory findings from Chapter 3 into a concrete modeling workflow that a data‑science team can deploy, monitor, and iterate on.
## 1. Recap of Business Goals
| Business Question | Decision Impact |
|-------------------|-----------------
| **Can we predict which customers will churn within 90 days?** | Reduce churn‑related revenue loss by 15 %.
| **What factors most strongly influence repeat purchases?** | Target marketing spend on high‑impact segments.
The modeling task is a binary classification: *churn* = 1, *no churn* = 0.
## 2. Feature Engineering Revisited
| Feature Category | Why it Matters | Transformation
|-------------------|----------------|---------------
| **Customer‑level** | Captures long‑term behavior | `avg_daily_spend`, `purchase_frequency`
| **Recency–Frequency–Monetary (RFM)** | Classic churn proxy | `recency_days`, `frequency_last_90`, `monetary_last_90`
| **Engagement** | Interaction signals | `logins_last_30`, `email_click_rate`
| **Demographics** | Contextual factors | One‑hot encode `region`, `subscription_type`
> **Tip:** Keep the feature set versioned. If new data sources become available, retrain a *new* model rather than mutating the existing pipeline.
## 3. Choosing a Base Model
| Algorithm | Typical Use‑Case | Pros | Cons |
|-----------|------------------|------|------|
| **Logistic Regression** | Baseline, interpretable | Fast, transparent | Linear decision boundary |
| **Gradient‑Boosted Trees (XGBoost / LightGBM)** | Strong performance on tabular data | Handles interactions, missing values | Requires careful tuning |
| **Random Forest** | Robust to over‑fitting | Easy to parallelize | Less interpretable |
| **Neural Network** | Complex nonlinear patterns | Flexible | Needs large data & hyper‑opt |
We will showcase a **LightGBM** pipeline because it balances speed, accuracy, and interpretability for business use.
## 4. Pipeline Construction (Python pseudocode)
python
import pandas as pd
from sklearn.model_selection import train_test_split, StratifiedKFold
from sklearn.metrics import roc_auc_score, f1_score
import lightgbm as lgb
# 1. Load and merge data
X = pd.read_csv('features.csv')
y = pd.read_csv('labels.csv')['churn']
# 2. Train‑test split preserving churn ratio
X_train, X_valid, y_train, y_valid = train_test_split(
X, y, test_size=0.2, stratify=y, random_state=42
)
# 3. LightGBM dataset wrappers
train_data = lgb.Dataset(X_train, label=y_train)
valid_data = lgb.Dataset(X_valid, label=y_valid, reference=train_data)
# 4. Parameter grid
params = {
'objective': 'binary',
'metric': 'auc',
'boosting_type': 'gbdt',
'num_leaves': 31,
'learning_rate': 0.05,
'feature_fraction': 0.9,
'bagging_fraction': 0.8,
'bagging_freq': 5,
'verbosity': -1,
'seed': 42
}
# 5. Train with early stopping
model = lgb.train(
params,
train_data,
num_boost_round=5000,
valid_sets=[train_data, valid_data],
valid_names=['train','valid'],
early_stopping_rounds=200,
verbose_eval=100
)
# 6. Evaluate
pred = model.predict(X_valid, num_iteration=model.best_iteration)
print('ROC‑AUC:', roc_auc_score(y_valid, pred))
print('F1‑Score:', f1_score(y_valid, pred > 0.5))
**Key Takeaways**:
* Use **StratifiedKFold** for cross‑validation to maintain churn ratio.
* Set a realistic `num_boost_round` and rely on early stopping.
* Store the *best_iteration* for reproducibility.
## 5. Model Evaluation Metrics
| Metric | What It Measures | Why It Matters for Churn |
|--------|------------------|--------------------------|
| **ROC‑AUC** | Ability to rank positive vs. negative | High values mean we can distinguish churners early.
| **Precision / Recall** | Balance between false positives / negatives | Business wants to avoid over‑reacting to false churn alerts.
| **F1‑Score** | Harmonic mean of precision & recall | One‑stop‑shop for overall predictive quality.
| **Calibration** | Predictive probabilities vs. observed churn | Enables risk‑based pricing or targeted retention offers.
A practical rule of thumb: *ROC‑AUC > 0.70* is acceptable for many commercial datasets; *> 0.80* is considered strong.
## 6. Bias‑Variance Trade‑off in Practice
1. **Under‑fitting** – Too few leaves or too high learning rate. Symptoms: low training & validation scores.
2. **Over‑fitting** – Too many trees or too low regularisation. Symptoms: high training score, low validation score.
3. **Regularisation** – Tune `lambda_l1`, `lambda_l2`, `min_data_in_leaf`.
4. **Early Stopping** – Prevents excessive training once validation performance plateaus.
Use **learning curves** to visualize the trade‑off.
## 7. Interpretability and Trust
* **SHAP values** (TreeExplainer) reveal feature impact per prediction.
* **Feature importance** from LightGBM (`model.feature_importance()`).
* **Partial Dependence Plots** to understand non‑linear effects.
Business stakeholders demand *explainability* before deploying a model. Present a **feature‑impact dashboard** that can be shared with marketing and finance teams.
## 8. Ethical and Fairness Considerations
| Issue | Checklist |
|-------|-----------|
| **Data Privacy** | Ensure GDPR‑style consent for customer data. |
| **Bias** | Check disparate impact across demographic groups (e.g., gender, region). |
| **Transparency** | Document the entire pipeline in a reproducible notebook (Git‑tracked). |
| **Model Drift** | Schedule monthly EDA + model re‑training if AUC drops >5 %. |
The goal is to avoid *profiling* customers unfairly while still enabling targeted retention.
## 9. Deployment Pathways
1. **Batch Prediction** – Generate churn scores nightly, load into the CRM.
2. **Real‑time API** – Wrap the model in a Flask/FastAPI service for instant scoring during web sessions.
3. **Feature Store** – Push engineered features into a central store (e.g., Feast) to keep training and serving aligned.
Document the deployment steps in a **CI/CD pipeline** (GitHub Actions + Docker). Store the final model in a versioned artifact repository (e.g., MLflow, DVC).
## 10. Closing Thought
Building a model is not a one‑off task; it’s a **continuous improvement loop**. By keeping your data pipeline reproducible, your metrics transparent, and your ethical guardrails in place, you convert raw EDA insights into business‑driving predictions that can be trusted across the organization.
---
*Next up: Chapter 5 – Operationalizing Models: From Pipeline to Profit.*