返回目錄
A
Data Intelligence: From Foundations to Applications - 第 4 章
Chapter 4: From Insight to Prediction – Building Robust Scikit‑Learn Pipelines and Communicating Uncertainty
發布於 2026-02-27 18:39
# Chapter 4: From Insight to Prediction – Building Robust Scikit‑Learn Pipelines and Communicating Uncertainty
## 4.1 Why Predictive Pipelines Matter
After we’ve turned raw numbers into narratives with visualizations, the next logical step is to ask: *what will happen next?* A predictive model is the bridge that turns descriptive stories into prescriptive guidance. But building a model isn’t just about fitting a curve; it’s about crafting a repeatable, auditable process that can be shared with stakeholders, deployed in production, and, crucially, understood in its limits.
> **Key takeaway** – Think of a pipeline as a *recipe* that takes you from raw ingredients (data) to a finished dish (predictions) while keeping the steps documented and reproducible.
## 4.2 The Scikit‑Learn Pipeline Pattern
Scikit‑Learn’s `Pipeline` and `ColumnTransformer` are designed to encapsulate a sequence of transformations followed by a final estimator. They enforce a clean separation of concerns:
| Stage | Purpose |
|-------|---------|
| Pre‑processing | Clean, normalize, and encode raw features |
| Feature Engineering | Create new derived variables |
| Model | Fit an algorithm to the processed data |
| Post‑processing | Interpret, calibrate, or adjust predictions |
### 4.2.1 Example: Predicting Customer Churn
python
import pandas as pd
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import roc_auc_score, classification_report
# 1. Load the dataset
url = "https://raw.githubusercontent.com/scikit-learn/scikit-learn/main/datasets/data/titanic.csv"
df = pd.read_csv(url)
# 2. Separate target and features
X = df.drop("Survived", axis=1)
y = df["Survived"]
# 3. Train‑test split
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42, stratify=y
)
# 4. Identify numeric and categorical columns
numeric_cols = X_train.select_dtypes(include=["int64", "float64"]).columns
categorical_cols = X_train.select_dtypes(include=["object"]).columns
# 5. Build the ColumnTransformer
preprocess = ColumnTransformer(
transformers=[
("num", StandardScaler(), numeric_cols),
("cat", OneHotEncoder(handle_unknown="ignore"), categorical_cols),
]
)
# 6. Assemble the full pipeline
clf = Pipeline(
steps=[
("preprocess", preprocess),
("model", RandomForestClassifier(n_estimators=200, random_state=42))
]
)
# 7. Hyper‑parameter tuning
param_grid = {
"model__max_depth": [None, 10, 20],
"model__min_samples_leaf": [1, 2, 4],
}
grid = GridSearchCV(clf, param_grid, cv=5, scoring="roc_auc")
grid.fit(X_train, y_train)
print("Best params:", grid.best_params_)
print("ROC‑AUC on test set:", roc_auc_score(y_test, grid.predict_proba(X_test)[:, 1]))
print(classification_report(y_test, grid.predict(X_test)))
> **Note** – The pipeline ensures that the same transformations applied during training are also applied to any new data, preventing data leakage and making deployment straightforward.
## 4.3 Visualizing Model Evaluation Inside the Pipeline
### 4.3.1 Custom Transformer for Metrics
A neat trick is to inject a *custom transformer* that collects evaluation metrics during cross‑validation. This keeps visualisation logic close to the pipeline.
python
from sklearn.base import BaseEstimator, TransformerMixin
import matplotlib.pyplot as plt
import numpy as np
class RocCurveCollector(BaseEstimator, TransformerMixin):
def __init__(self):
self.fpr = None
self.tpr = None
self.auc = None
def fit(self, X, y=None):
return self
def transform(self, X, y=None):
# During fit, the model is already trained.
return X
def predict_proba(self, X, y_true):
from sklearn.metrics import roc_curve, auc
y_pred = self.model.predict_proba(X)[:, 1]
self.fpr, self.tpr, _ = roc_curve(y_true, y_pred)
self.auc = auc(self.fpr, self.tpr)
return y_pred
# Insert into pipeline
roc_collector = RocCurveCollector()
clf_with_roc = Pipeline(
steps=[
("preprocess", preprocess),
("model", RandomForestClassifier(n_estimators=200, random_state=42)),
("roc", roc_collector)
]
)
After fitting, you can plot directly:
python
clf_with_roc.fit(X_train, y_train)
roc_collector.predict_proba(X_test, y_test) # populate metrics
plt.plot(roc_collector.fpr, roc_collector.tpr, label=f"AUC = {roc_collector.auc:.3f}")
plt.plot([0, 1], [0, 1], linestyle='--', color='gray')
plt.xlabel("False Positive Rate")
plt.ylabel("True Positive Rate")
plt.title("ROC Curve – Pipeline In‑Line")
plt.legend()
plt.show()
### 4.3.2 Learning Curves and Calibration
- **Learning Curves** reveal whether your model would benefit from more data or simpler hypotheses.
- **Calibration Plots** show how predicted probabilities map to actual outcomes – crucial for risk‑aware decisions.
python
from sklearn.model_selection import learning_curve
import seaborn as sns
train_sizes, train_scores, test_scores = learning_curve(
clf_with_roc, X_train, y_train, cv=5, scoring="roc_auc",
n_jobs=-1, train_sizes=np.linspace(0.1, 1.0, 10)
)
train_scores_mean = train_scores.mean(axis=1)
test_scores_mean = test_scores.mean(axis=1)
plt.figure(figsize=(8, 5))
plt.plot(train_sizes, train_scores_mean, label="Training AUC")
plt.plot(train_sizes, test_scores_mean, label="Cross‑Validation AUC")
plt.xlabel("Training Set Size")
plt.ylabel("AUC")
plt.title("Learning Curve")
plt.legend()
plt.show()
## 4.4 Communicating Model Uncertainty
Predictive models are powerful, but they’re not crystal balls. Stakeholders need to understand *how confident* the model is, and where it may falter.
### 4.4.1 Probability Scores and Confidence Intervals
- For classification, use `predict_proba` to show the model’s confidence.
- For regression, bootstrap residuals or use quantile regression to provide prediction intervals.
python
# Bootstrap example for regression
from sklearn.utils import resample
n_bootstrap = 200
preds = []
for i in range(n_bootstrap):
X_res, y_res = resample(X_train, y_train)
model = RandomForestRegressor(n_estimators=100, random_state=i)
model.fit(X_res, y_res)
preds.append(model.predict(X_test))
preds = np.array(preds)
ci_lower = np.percentile(preds, 2.5, axis=0)
ci_upper = np.percentile(preds, 97.5, axis=0)
### 4.4.2 SHAP for Local and Global Explanations
SHAP (SHapley Additive exPlanations) quantifies each feature’s contribution to a prediction. Visualizing SHAP values helps stakeholders see *why* the model behaves the way it does.
python
import shap
explainer = shap.TreeExplainer(grid.best_estimator_)
shap_values = explainer.shap_values(X_test)
shap.summary_plot(shap_values, X_test, plot_type="bar")
### 4.4.3 Framing Results in Business Language
> **Example** – “The model predicts a 78% chance of churn for this cohort, but the 95% confidence interval ranges from 60% to 90%. In operational terms, you might flag customers with >70% probability for targeted retention offers, but treat those near the lower bound with caution.”
By pairing numeric metrics with clear narrative, you avoid the “black‑box” perception and foster trust.
## 4.5 End‑to‑End Example: A Reproducible Notebook
Below is a compact notebook outline you can copy‑paste into Jupyter. It demonstrates the full cycle: data import, preprocessing, pipeline construction, evaluation, visualisation, and uncertainty communication.
python
# 0. Imports
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split, GridSearchCV, learning_curve
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import roc_auc_score, classification_report, roc_curve, auc
from sklearn.base import BaseEstimator, TransformerMixin
import shap
# 1. Load data
# ... (same as above) ...
# 2. Build pipeline with ROC collector
class RocCurveCollector(BaseEstimator, TransformerMixin):
def __init__(self):
self.fpr = None
self.tpr = None
self.auc = None
def fit(self, X, y=None):
return self
def transform(self, X, y=None):
return X
def predict_proba(self, X, y_true):
y_pred = self.model.predict_proba(X)[:, 1]
self.fpr, self.tpr, _ = roc_curve(y_true, y_pred)
self.auc = auc(self.fpr, self.tpr)
return y_pred
# 3. Pipeline
clf = Pipeline(
steps=[
("preprocess", preprocess),
("model", RandomForestClassifier(n_estimators=200, random_state=42)),
("roc", RocCurveCollector())
]
)
# 4. Hyper‑parameter search
param_grid = {
"model__max_depth": [None, 10, 20],
"model__min_samples_leaf": [1, 2, 4],
}
grid = GridSearchCV(clf, param_grid, cv=5, scoring="roc_auc")
grid.fit(X_train, y_train)
# 5. Evaluation
print("Best params:", grid.best_params_)
print("Test ROC‑AUC:", roc_auc_score(y_test, grid.predict_proba(X_test)[:, 1]))
print(classification_report(y_test, grid.predict(X_test)))
# 6. Visualisations
# ROC curve
roc_collector = grid.best_estimator_.named_steps["roc"]
roc_collector.predict_proba(X_test, y_test)
plt.figure()
plt.plot(roc_collector.fpr, roc_collector.tpr, label=f"AUC = {roc_collector.auc:.3f}")
plt.plot([0, 1], [0, 1], linestyle='--', color='gray')
plt.xlabel("FPR")
plt.ylabel("TPR")
plt.title("ROC Curve")
plt.legend()
plt.show()
# Learning curve
train_sizes, train_scores, test_scores = learning_curve(
grid.best_estimator_, X_train, y_train, cv=5, scoring="roc_auc",
n_jobs=-1, train_sizes=np.linspace(0.1, 1.0, 10)
)
plt.figure(figsize=(8, 5))
plt.plot(train_sizes, train_scores.mean(axis=1), label="Training AUC")
plt.plot(train_sizes, test_scores.mean(axis=1), label="CV AUC")
plt.xlabel("Training Size")
plt.ylabel("AUC")
plt.title("Learning Curve")
plt.legend()
plt.show()
# 7. SHAP explanations
explainer = shap.TreeExplainer(grid.best_estimator_.named_steps["model"])
shap_values = explainer.shap_values(X_test)
shap.summary_plot(shap_values, X_test, plot_type="bar")
Save this notebook, run it end‑to‑end, and you’ll have a reproducible pipeline that you can hand over to the data‑ops team.
## 4.6 Summary
1. **Scikit‑Learn Pipelines** keep your modeling process tidy and reproducible.
2. **Embedding Visualizations** (ROC, learning curves, SHAP) inside the pipeline keeps evaluation integrated with modeling.
3. **Uncertainty Communication**—through probability scores, confidence intervals, and SHAP plots—transforms raw model output into actionable insights.
4. **End‑to‑End Reproducibility** ensures that what you develop in the lab can be deployed in production without surprises.
With the foundation laid in the previous chapter, you’re now ready to **predict**. In the next chapter, we’ll dive into *time‑series forecasting* and *deep learning* for unstructured data, expanding the predictive horizon even further.