返回目錄
A
Data Science for the Modern Analyst: From Data to Insight - 第 5 章
Chapter 5: Machine Learning Techniques
發布於 2026-03-04 15:39
# Chapter 5: Machine Learning Techniques
> **Goal of this chapter** – Equip you with a systematic approach to design, train, and evaluate machine‑learning models that can be integrated into real‑world analytics pipelines. We cover the three main paradigms—supervised, unsupervised, and reinforcement learning—alongside the practical tools and best‑practice patterns that make experimentation reproducible, efficient, and trustworthy.
## 1. Overview of Machine‑Learning Paradigms
| Paradigm | Typical Tasks | Representative Algorithms | Key Evaluation Metrics |
|----------|---------------|---------------------------|-----------------------|
| **Supervised** | Classification, Regression | Logistic Regression, Random Forest, Gradient Boosting, Neural Networks | Accuracy, AUC‑ROC, RMSE, MAE |
| **Unsupervised** | Clustering, Dimensionality Reduction, Anomaly Detection | K‑Means, DBSCAN, PCA, t‑SNE, Isolation Forest | Silhouette, Inertia, Reconstruction Error |
| **Reinforcement** | Decision‑making under uncertainty | Q‑Learning, SARSA, DQN, Policy Gradient | Reward, Episode Length |
> **Why the distinction matters** – The choice of paradigm determines the data requirement, the structure of the learning pipeline, the hyperparameters to tune, and the risk of overfitting or data leakage.
## 2. Building a Reproducible Supervised‑Learning Pipeline
Below we walk through a canonical pipeline using **scikit‑learn** and **pandas**. The same pattern can be adapted for other libraries (e.g., PyTorch, XGBoost, LightGBM).
### 2.1 Data Ingestion & Pre‑processing
```python
import pandas as pd
from sklearn.model_selection import train_test_split
# Load tabular data
df = pd.read_csv("data/loan_applications.csv")
# Basic sanity checks
print(df.head())
print(df.describe(include="all"))
```
Key steps:
1. **Feature Selection** – Remove columns that are not predictive (e.g., IDs, timestamps).
2. **Missing‑Value Imputation** – Simple strategies (mean/median) or model‑based (KNNImputer).
3. **Categorical Encoding** – One‑hot for nominal, ordinal encoding for ordinal, or target‑encoding.
4. **Feature Scaling** – Standardization or Min‑Max scaling for algorithms sensitive to scale.
Using **ColumnTransformer** keeps the pipeline tidy.
```python
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.impute import SimpleImputer
numeric_features = df.select_dtypes(include=["int64", "float64"]).columns
categorical_features = df.select_dtypes(include=["object", "category"]).columns
numeric_transformer = Pipeline(
steps=[
("imputer", SimpleImputer(strategy="median")),
("scaler", StandardScaler()),
]
)
categorical_transformer = Pipeline(
steps=[
("imputer", SimpleImputer(strategy="most_frequent")),
("onehot", OneHotEncoder(handle_unknown="ignore")),
]
)
preprocessor = ColumnTransformer(
transformers=[
("num", numeric_transformer, numeric_features),
("cat", categorical_transformer, categorical_features),
]
)
```
### 2.2 Train / Validation Split
```python
X = df.drop(columns=["default"]) # target column
y = df["default"]
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, stratify=y, random_state=42
)
```
> **Tip** – For imbalanced problems, use stratified splits to preserve class distribution.
### 2.3 Model Selection & Hyperparameter Tuning
```python
from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV
clf = Pipeline(
steps=[
("preprocess", preprocessor),
("model", RandomForestClassifier(random_state=42)),
]
)
param_grid = {
"model__n_estimators": [100, 200, 500],
"model__max_depth": [None, 10, 20],
"model__min_samples_split": [2, 5, 10],
}
grid_search = GridSearchCV(clf, param_grid, cv=5, scoring="roc_auc", n_jobs=-1)
grid_search.fit(X_train, y_train)
```
**Best practice** – Use **nested cross‑validation** when reporting performance on a held‑out test set.
```python
from sklearn.model_selection import cross_val_score, StratifiedKFold
outer_cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
inner_cv = StratifiedKFold(n_splits=3, shuffle=True, random_state=42)
nested_scores = cross_val_score(
grid_search, X, y, cv=outer_cv, scoring="roc_auc"
)
print("Nested AUC: %.3f ± %.3f" % (nested_scores.mean(), nested_scores.std()))
```
### 2.4 Model Evaluation
```python
from sklearn.metrics import classification_report, roc_auc_score, confusion_matrix
y_pred = grid_search.predict(X_test)
y_prob = grid_search.predict_proba(X_test)[:, 1]
print("AUC: %.3f" % roc_auc_score(y_test, y_prob))
print(classification_report(y_test, y_pred))
print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred))
```
### 2.5 Feature Importance & Interpretability
```python
importances = grid_search.best_estimator_.named_steps["model"].feature_importances_
feature_names = grid_search.best_estimator_.named_steps["preprocess"].transformers_[0][1].named_steps["onehot"].get_feature_names_out(categorical_features)
# Combine numeric and one‑hot feature names
import numpy as np
all_features = np.concatenate([numeric_features, feature_names])
# Plot
import matplotlib.pyplot as plt
importances_sorted = np.argsort(importances)[::-1]
plt.figure(figsize=(12, 6))
plt.bar(range(len(importances)), importances[importances_sorted])
plt.xticks(range(len(importances)), all_features[importances_sorted], rotation=90)
plt.title("Feature Importances")
plt.tight_layout()
plt.show()
```
> **Beyond Random Forests** – For linear models, coefficients directly reveal importance. For tree‑based or black‑box models, use SHAP or LIME.
### 2.6 Deployment‑Ready Model Packaging
```python
import joblib
joblib.dump(grid_search.best_estimator_, "models/loan_default_model.pkl")
```
> **Tip** – Store the model artifact and a versioned **requirements.txt** to ensure consistency.
## 3. Unsupervised Learning: Patterns & Anomalies
### 3.1 Clustering
```python
from sklearn.cluster import KMeans, DBSCAN
from sklearn.preprocessing import StandardScaler
X_scaled = StandardScaler().fit_transform(X)
# Elbow method to choose k
inertia = []
K = range(1, 11)
for k in K:
kmeans = KMeans(n_clusters=k, random_state=42)
kmeans.fit(X_scaled)
inertia.append(kmeans.inertia_)
plt.plot(K, inertia, marker="o")
plt.xlabel("Number of clusters (k)")
plt.ylabel("Inertia")
plt.title("Elbow Curve")
plt.show()
```
After selecting *k*, fit and analyze clusters:
```python
kmeans = KMeans(n_clusters=3, random_state=42)
clusters = kmeans.fit_predict(X_scaled)
X['cluster'] = clusters
```
### 3.2 Dimensionality Reduction
| Algorithm | Use‑case | Example |
|-----------|----------|---------|
| PCA | Reduce noise, speed up downstream | `PCA(n_components=0.95)` |
| t‑SNE | Visualize high‑dim clusters | `TSNE(n_components=2)` |
| UMAP | Preserve local & global structure | `UMAP(n_neighbors=15, n_components=2)` |
```python
from umap import UMAP
umap = UMAP(n_components=2, random_state=42)
emb = umap.fit_transform(X_scaled)
plt.scatter(emb[:, 0], emb[:, 1], c=clusters, cmap="Spectral")
plt.title("UMAP embedding colored by K‑Means cluster")
plt.show()
```
### 3.3 Anomaly Detection
```python
from sklearn.ensemble import IsolationForest
isof = IsolationForest(n_estimators=200, contamination="auto", random_state=42)
isof.fit(X_scaled)
anomaly_scores = isof.decision_function(X_scaled)
anomalies = isof.predict(X_scaled) # -1 anomaly, 1 normal
# Visualize top anomalies
top_anomalies = np.argsort(anomaly_scores)[:10]
print("Top anomalies:", X.iloc[top_anomalies])
```
## 4. Reinforcement Learning (Brief Overview)
Reinforcement learning (RL) differs fundamentally from supervised learning: instead of learning from labeled examples, an agent learns a policy that maximizes cumulative reward in a Markov Decision Process (MDP). Core components:
| Component | Definition |
|-----------|------------|
| **State** | Current observation of the environment |
| **Action** | Decision taken by the agent |
| **Reward** | Feedback signal guiding learning |
| **Policy** | Mapping from states to actions |
| **Value Function** | Expected future reward from a state |
### 4.1 Classic Algorithms
* **Dynamic Programming** – Value Iteration, Policy Iteration (requires full knowledge of transition dynamics).
* **Temporal‑Difference Learning** – SARSA, Q‑Learning (model‑free, off‑policy or on‑policy).
* **Policy Gradient** – REINFORCE, Actor‑Critic.
* **Deep RL** – DQN (Deep Q‑Network), A3C, PPO.
### 4.2 Practical Pipeline Skeleton (Python + OpenAI Gym)
```python
import gym
import numpy as np
import torch
import torch.nn as nn
import torch.optim as optim
# Simple MLP policy
class PolicyNet(nn.Module):
def __init__(self, state_dim, action_dim):
super().__init__()
self.net = nn.Sequential(
nn.Linear(state_dim, 128), nn.ReLU(),
nn.Linear(128, action_dim), nn.Softmax(dim=-1),
)
def forward(self, x):
return self.net(x)
env = gym.make("CartPole-v1")
policy = PolicyNet(env.observation_space.shape[0], env.action_space.n)
optimizer = optim.Adam(policy.parameters(), lr=1e-3)
# Simple REINFORCE loop
for episode in range(500):
state = env.reset()
log_probs = []
rewards = []
done = False
while not done:
state_tensor = torch.tensor(state, dtype=torch.float32)
probs = policy(state_tensor)
dist = torch.distributions.Categorical(probs)
action = dist.sample()
log_prob = dist.log_prob(action)
state, reward, done, _ = env.step(action.item())
log_probs.append(log_prob)
rewards.append(reward)
# Compute discounted return
G = 0
discounted_returns = []
for r in reversed(rewards):
G = r + 0.99 * G
discounted_returns.insert(0, G)
discounted_returns = torch.tensor(discounted_returns)
discounted_returns = (discounted_returns - discounted_returns.mean()) / (discounted_returns.std() + 1e-8)
loss = -torch.sum(torch.stack(log_probs) * discounted_returns)
optimizer.zero_grad()
loss.backward()
optimizer.step()
```
> **Note** – RL requires careful reward shaping, exploration‑exploitation balance, and often a simulator or environment like OpenAI Gym.
## 5. Practical Checklist for a Robust ML Pipeline
| Step | What to Check | Why It Matters |
|------|---------------|----------------|
| Data Quality | No leakage, consistent schema, missingness handled | Prevents false insights |
| Feature Engineering | Relevance, multicollinearity, scaling | Improves model performance |
| Train/Val/Test Split | Stratification, temporal ordering | Reflects deployment scenario |
| Hyperparameter Search | Search space, evaluation metric | Finds optimal trade‑off |
| Validation Strategy | Nested CV, cross‑validation folds | Avoids optimistic bias |
| Model Interpretability | Feature importance, SHAP | Builds stakeholder trust |
| Reproducibility | Fixed random seeds, versioned environment | Ensures consistency across runs |
| Deployment Packaging | Joblib/Pickle, Docker | Facilitates production rollout |
| Monitoring | Drift detection, performance metrics | Maintains model health |
## 6. Case Study Snapshot: Predicting Customer Churn
1. **Problem** – Identify customers likely to cancel subscription.
2. **Data** – 50k rows, 20 features (demographics, usage, support tickets).
3. **Pipeline** –
* Impute missing usage with median.
* One‑hot encode categorical fields.
* Standardize numeric features.
* Train XGBoost with early stopping on a validation set.
4. **Evaluation** – AUC‑ROC = 0.88, Precision@Top‑10 = 0.63.
5. **Interpretability** – SHAP highlights `Monthly Spend` and `Ticket Frequency` as top drivers.
6. **Deployment** – Served via FastAPI in a Docker container on Azure Container Instances.
7. **Monitoring** – Drift alerts when `Ticket Frequency` mean shifts > 15%.
> **Take‑away** – A disciplined pipeline that integrates preprocessing, robust model selection, interpretability, and monitoring turns predictive insights into actionable business assets.
---
> **Next Chapter Preview** – *Model Monitoring & Continuous Learning*: Learn how to keep your model’s performance in check as real‑world data drifts, and how to iterate on models without manual intervention.