聊天視窗

Data Science for the Modern Analyst: From Data to Insight - 第 5 章

Chapter 5: Machine Learning Techniques

發布於 2026-03-04 15:39

# Chapter 5: Machine Learning Techniques > **Goal of this chapter** – Equip you with a systematic approach to design, train, and evaluate machine‑learning models that can be integrated into real‑world analytics pipelines. We cover the three main paradigms—supervised, unsupervised, and reinforcement learning—alongside the practical tools and best‑practice patterns that make experimentation reproducible, efficient, and trustworthy. ## 1. Overview of Machine‑Learning Paradigms | Paradigm | Typical Tasks | Representative Algorithms | Key Evaluation Metrics | |----------|---------------|---------------------------|-----------------------| | **Supervised** | Classification, Regression | Logistic Regression, Random Forest, Gradient Boosting, Neural Networks | Accuracy, AUC‑ROC, RMSE, MAE | | **Unsupervised** | Clustering, Dimensionality Reduction, Anomaly Detection | K‑Means, DBSCAN, PCA, t‑SNE, Isolation Forest | Silhouette, Inertia, Reconstruction Error | | **Reinforcement** | Decision‑making under uncertainty | Q‑Learning, SARSA, DQN, Policy Gradient | Reward, Episode Length | > **Why the distinction matters** – The choice of paradigm determines the data requirement, the structure of the learning pipeline, the hyperparameters to tune, and the risk of overfitting or data leakage. ## 2. Building a Reproducible Supervised‑Learning Pipeline Below we walk through a canonical pipeline using **scikit‑learn** and **pandas**. The same pattern can be adapted for other libraries (e.g., PyTorch, XGBoost, LightGBM). ### 2.1 Data Ingestion & Pre‑processing ```python import pandas as pd from sklearn.model_selection import train_test_split # Load tabular data df = pd.read_csv("data/loan_applications.csv") # Basic sanity checks print(df.head()) print(df.describe(include="all")) ``` Key steps: 1. **Feature Selection** – Remove columns that are not predictive (e.g., IDs, timestamps). 2. **Missing‑Value Imputation** – Simple strategies (mean/median) or model‑based (KNNImputer). 3. **Categorical Encoding** – One‑hot for nominal, ordinal encoding for ordinal, or target‑encoding. 4. **Feature Scaling** – Standardization or Min‑Max scaling for algorithms sensitive to scale. Using **ColumnTransformer** keeps the pipeline tidy. ```python from sklearn.compose import ColumnTransformer from sklearn.preprocessing import OneHotEncoder, StandardScaler from sklearn.impute import SimpleImputer numeric_features = df.select_dtypes(include=["int64", "float64"]).columns categorical_features = df.select_dtypes(include=["object", "category"]).columns numeric_transformer = Pipeline( steps=[ ("imputer", SimpleImputer(strategy="median")), ("scaler", StandardScaler()), ] ) categorical_transformer = Pipeline( steps=[ ("imputer", SimpleImputer(strategy="most_frequent")), ("onehot", OneHotEncoder(handle_unknown="ignore")), ] ) preprocessor = ColumnTransformer( transformers=[ ("num", numeric_transformer, numeric_features), ("cat", categorical_transformer, categorical_features), ] ) ``` ### 2.2 Train / Validation Split ```python X = df.drop(columns=["default"]) # target column y = df["default"] X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.2, stratify=y, random_state=42 ) ``` > **Tip** – For imbalanced problems, use stratified splits to preserve class distribution. ### 2.3 Model Selection & Hyperparameter Tuning ```python from sklearn.pipeline import Pipeline from sklearn.ensemble import RandomForestClassifier from sklearn.model_selection import GridSearchCV, RandomizedSearchCV clf = Pipeline( steps=[ ("preprocess", preprocessor), ("model", RandomForestClassifier(random_state=42)), ] ) param_grid = { "model__n_estimators": [100, 200, 500], "model__max_depth": [None, 10, 20], "model__min_samples_split": [2, 5, 10], } grid_search = GridSearchCV(clf, param_grid, cv=5, scoring="roc_auc", n_jobs=-1) grid_search.fit(X_train, y_train) ``` **Best practice** – Use **nested cross‑validation** when reporting performance on a held‑out test set. ```python from sklearn.model_selection import cross_val_score, StratifiedKFold outer_cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42) inner_cv = StratifiedKFold(n_splits=3, shuffle=True, random_state=42) nested_scores = cross_val_score( grid_search, X, y, cv=outer_cv, scoring="roc_auc" ) print("Nested AUC: %.3f ± %.3f" % (nested_scores.mean(), nested_scores.std())) ``` ### 2.4 Model Evaluation ```python from sklearn.metrics import classification_report, roc_auc_score, confusion_matrix y_pred = grid_search.predict(X_test) y_prob = grid_search.predict_proba(X_test)[:, 1] print("AUC: %.3f" % roc_auc_score(y_test, y_prob)) print(classification_report(y_test, y_pred)) print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred)) ``` ### 2.5 Feature Importance & Interpretability ```python importances = grid_search.best_estimator_.named_steps["model"].feature_importances_ feature_names = grid_search.best_estimator_.named_steps["preprocess"].transformers_[0][1].named_steps["onehot"].get_feature_names_out(categorical_features) # Combine numeric and one‑hot feature names import numpy as np all_features = np.concatenate([numeric_features, feature_names]) # Plot import matplotlib.pyplot as plt importances_sorted = np.argsort(importances)[::-1] plt.figure(figsize=(12, 6)) plt.bar(range(len(importances)), importances[importances_sorted]) plt.xticks(range(len(importances)), all_features[importances_sorted], rotation=90) plt.title("Feature Importances") plt.tight_layout() plt.show() ``` > **Beyond Random Forests** – For linear models, coefficients directly reveal importance. For tree‑based or black‑box models, use SHAP or LIME. ### 2.6 Deployment‑Ready Model Packaging ```python import joblib joblib.dump(grid_search.best_estimator_, "models/loan_default_model.pkl") ``` > **Tip** – Store the model artifact and a versioned **requirements.txt** to ensure consistency. ## 3. Unsupervised Learning: Patterns & Anomalies ### 3.1 Clustering ```python from sklearn.cluster import KMeans, DBSCAN from sklearn.preprocessing import StandardScaler X_scaled = StandardScaler().fit_transform(X) # Elbow method to choose k inertia = [] K = range(1, 11) for k in K: kmeans = KMeans(n_clusters=k, random_state=42) kmeans.fit(X_scaled) inertia.append(kmeans.inertia_) plt.plot(K, inertia, marker="o") plt.xlabel("Number of clusters (k)") plt.ylabel("Inertia") plt.title("Elbow Curve") plt.show() ``` After selecting *k*, fit and analyze clusters: ```python kmeans = KMeans(n_clusters=3, random_state=42) clusters = kmeans.fit_predict(X_scaled) X['cluster'] = clusters ``` ### 3.2 Dimensionality Reduction | Algorithm | Use‑case | Example | |-----------|----------|---------| | PCA | Reduce noise, speed up downstream | `PCA(n_components=0.95)` | | t‑SNE | Visualize high‑dim clusters | `TSNE(n_components=2)` | | UMAP | Preserve local & global structure | `UMAP(n_neighbors=15, n_components=2)` | ```python from umap import UMAP umap = UMAP(n_components=2, random_state=42) emb = umap.fit_transform(X_scaled) plt.scatter(emb[:, 0], emb[:, 1], c=clusters, cmap="Spectral") plt.title("UMAP embedding colored by K‑Means cluster") plt.show() ``` ### 3.3 Anomaly Detection ```python from sklearn.ensemble import IsolationForest isof = IsolationForest(n_estimators=200, contamination="auto", random_state=42) isof.fit(X_scaled) anomaly_scores = isof.decision_function(X_scaled) anomalies = isof.predict(X_scaled) # -1 anomaly, 1 normal # Visualize top anomalies top_anomalies = np.argsort(anomaly_scores)[:10] print("Top anomalies:", X.iloc[top_anomalies]) ``` ## 4. Reinforcement Learning (Brief Overview) Reinforcement learning (RL) differs fundamentally from supervised learning: instead of learning from labeled examples, an agent learns a policy that maximizes cumulative reward in a Markov Decision Process (MDP). Core components: | Component | Definition | |-----------|------------| | **State** | Current observation of the environment | | **Action** | Decision taken by the agent | | **Reward** | Feedback signal guiding learning | | **Policy** | Mapping from states to actions | | **Value Function** | Expected future reward from a state | ### 4.1 Classic Algorithms * **Dynamic Programming** – Value Iteration, Policy Iteration (requires full knowledge of transition dynamics). * **Temporal‑Difference Learning** – SARSA, Q‑Learning (model‑free, off‑policy or on‑policy). * **Policy Gradient** – REINFORCE, Actor‑Critic. * **Deep RL** – DQN (Deep Q‑Network), A3C, PPO. ### 4.2 Practical Pipeline Skeleton (Python + OpenAI Gym) ```python import gym import numpy as np import torch import torch.nn as nn import torch.optim as optim # Simple MLP policy class PolicyNet(nn.Module): def __init__(self, state_dim, action_dim): super().__init__() self.net = nn.Sequential( nn.Linear(state_dim, 128), nn.ReLU(), nn.Linear(128, action_dim), nn.Softmax(dim=-1), ) def forward(self, x): return self.net(x) env = gym.make("CartPole-v1") policy = PolicyNet(env.observation_space.shape[0], env.action_space.n) optimizer = optim.Adam(policy.parameters(), lr=1e-3) # Simple REINFORCE loop for episode in range(500): state = env.reset() log_probs = [] rewards = [] done = False while not done: state_tensor = torch.tensor(state, dtype=torch.float32) probs = policy(state_tensor) dist = torch.distributions.Categorical(probs) action = dist.sample() log_prob = dist.log_prob(action) state, reward, done, _ = env.step(action.item()) log_probs.append(log_prob) rewards.append(reward) # Compute discounted return G = 0 discounted_returns = [] for r in reversed(rewards): G = r + 0.99 * G discounted_returns.insert(0, G) discounted_returns = torch.tensor(discounted_returns) discounted_returns = (discounted_returns - discounted_returns.mean()) / (discounted_returns.std() + 1e-8) loss = -torch.sum(torch.stack(log_probs) * discounted_returns) optimizer.zero_grad() loss.backward() optimizer.step() ``` > **Note** – RL requires careful reward shaping, exploration‑exploitation balance, and often a simulator or environment like OpenAI Gym. ## 5. Practical Checklist for a Robust ML Pipeline | Step | What to Check | Why It Matters | |------|---------------|----------------| | Data Quality | No leakage, consistent schema, missingness handled | Prevents false insights | | Feature Engineering | Relevance, multicollinearity, scaling | Improves model performance | | Train/Val/Test Split | Stratification, temporal ordering | Reflects deployment scenario | | Hyperparameter Search | Search space, evaluation metric | Finds optimal trade‑off | | Validation Strategy | Nested CV, cross‑validation folds | Avoids optimistic bias | | Model Interpretability | Feature importance, SHAP | Builds stakeholder trust | | Reproducibility | Fixed random seeds, versioned environment | Ensures consistency across runs | | Deployment Packaging | Joblib/Pickle, Docker | Facilitates production rollout | | Monitoring | Drift detection, performance metrics | Maintains model health | ## 6. Case Study Snapshot: Predicting Customer Churn 1. **Problem** – Identify customers likely to cancel subscription. 2. **Data** – 50k rows, 20 features (demographics, usage, support tickets). 3. **Pipeline** – * Impute missing usage with median. * One‑hot encode categorical fields. * Standardize numeric features. * Train XGBoost with early stopping on a validation set. 4. **Evaluation** – AUC‑ROC = 0.88, Precision@Top‑10 = 0.63. 5. **Interpretability** – SHAP highlights `Monthly Spend` and `Ticket Frequency` as top drivers. 6. **Deployment** – Served via FastAPI in a Docker container on Azure Container Instances. 7. **Monitoring** – Drift alerts when `Ticket Frequency` mean shifts > 15%. > **Take‑away** – A disciplined pipeline that integrates preprocessing, robust model selection, interpretability, and monitoring turns predictive insights into actionable business assets. --- > **Next Chapter Preview** – *Model Monitoring & Continuous Learning*: Learn how to keep your model’s performance in check as real‑world data drifts, and how to iterate on models without manual intervention.