返回目錄
A
Data Science Unveiled: From Raw Data to Insightful Decisions - 第 6 章
Chapter 6: From Features to Forecasts: Building Predictive Models
發布於 2026-03-06 21:37
# Chapter 6: From Features to Forecasts: Building Predictive Models
In the previous chapters we walked through data acquisition, cleaning, and feature engineering. We left the dataset polished and ready for the next logical step: **translating those engineered features into predictive power**. In this chapter we’ll dive into supervised learning, discuss model selection, hyper‑parameter tuning, and validation strategies. We’ll also keep an eye on reproducibility and ethics as we build, test, and refine our models.
---
## 6.1 Recap of Our Prepared Dataset
```python
import pandas as pd
# Load the engineered dataset
X = pd.read_csv("/data/engineered_features.csv")
y = pd.read_csv("/data/target.csv")
print(X.shape, y.shape)
# (12, 4500) (12,)
```
> **Why this matters**: Having a clean, consistent, and well‑documented feature set is the foundation for any reliable predictive model. Each column now represents either a raw, derived, or transformed variable that carries meaningful information about our problem.
---
## 6.2 Choosing the Right Algorithm
| Problem type | Typical algorithms | When to start here |
|--------------|--------------------|---------------------|
| **Regression** | Linear Regression, Lasso, ElasticNet, Gradient Boosting (XGBoost, LightGBM) | When the target is continuous and relationships look linear or mildly nonlinear |
| **Classification** | Logistic Regression, Random Forest, Gradient Boosting, Support Vector Machines | When the target is categorical (binary or multiclass) and we want interpretability or high accuracy |
| **Time‑Series Forecasting** | ARIMA, Prophet, LSTM | When data points have a temporal dependency |
For our example—a binary churn prediction task—we’ll start with a **Logistic Regression** baseline, then move to **Gradient Boosting** to capture complex interactions.
---
## 6.3 Building a Reproducible Pipeline
```python
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
pipeline = Pipeline([
('scaler', StandardScaler()),
('clf', LogisticRegression(solver='lbfgs', max_iter=200))
])
```
> **Tip**: Encapsulating preprocessing and the estimator inside a pipeline guarantees that every step uses the same transformations during training and inference.
---
## 6.4 Hyper‑Parameter Tuning
We’ll use **RandomizedSearchCV** for speed and **Optuna** for more sophisticated Bayesian optimization.
```python
from sklearn.model_selection import RandomizedSearchCV
from scipy.stats import uniform, randint
param_grid = {
'clf__C': uniform(0.01, 10),
'clf__penalty': ['l1', 'l2']
}
search = RandomizedSearchCV(pipeline, param_grid, n_iter=50, scoring='roc_auc', cv=5, verbose=1)
search.fit(X, y)
print("Best params:", search.best_params_)
```
### Optuna Example
```python
import optuna
from optuna.samplers import TPESampler
def objective(trial):
C = trial.suggest_loguniform('C', 1e-3, 10)
penalty = trial.suggest_categorical('penalty', ['l1', 'l2'])
clf = LogisticRegression(C=C, penalty=penalty, solver='saga', max_iter=200)
pipeline = Pipeline([('scaler', StandardScaler()), ('clf', clf)])
score = cross_val_score(pipeline, X, y, cv=5, scoring='roc_auc').mean()
return score
study = optuna.create_study(direction='maximize', sampler=TPESampler())
study.optimize(objective, n_trials=100)
print('Best trial:', study.best_trial.params)
```
> **Why tune**: Even a modest change in hyper‑parameters can shift the balance between bias and variance, especially for regularized models.
---
## 6.5 Validation Strategies
| Technique | When to Use | Key Considerations |
|-----------|-------------|--------------------|
| **Hold‑out** | Small datasets | Fast, but may be high variance |
| **k‑Fold CV** | Medium to large datasets | Balanced estimate, but assumes i.i.d. |
| **Stratified k‑Fold** | Classification tasks | Maintains class distribution |
| **Time‑Series Split** | Temporal data | Prevents look‑ahead bias |
We’ll stick with **Stratified 5‑Fold CV** for churn prediction.
---
## 6.6 Avoiding Overfitting
1. **Feature Selection** – Use `SelectKBest` or `RFE`.
2. **Regularization** – L1 (Lasso) can zero‑out irrelevant features.
3. **Early Stopping** – For tree‑based models like XGBoost.
4. **Cross‑Validation** – Keep a separate validation set that the model never sees.
5. **Ensemble Methods** – Combine predictions from diverse models.
```python
from sklearn.feature_selection import RFE
from sklearn.ensemble import RandomForestClassifier
rf = RandomForestClassifier(n_estimators=200, random_state=42)
selector = RFE(rf, n_features_to_select=50)
X_selected = selector.fit_transform(X, y)
```
---
## 6.7 Model Evaluation Metrics
| Metric | Formula | When to prefer |
|--------|---------|----------------|
| **Accuracy** | (TP+TN)/(P+N) | Balanced classes |
| **Precision** | TP/(TP+FP) | When false positives are costly |
| **Recall (Sensitivity)** | TP/(TP+FN) | When false negatives are costly |
| **F1‑Score** | 2*(Precision*Recall)/(Precision+Recall) | Harmonic mean of precision & recall |
| **ROC‑AUC** | Area under ROC curve | Probability outputs, class imbalance |
| **PR‑AUC** | Area under Precision‑Recall curve | Severe class imbalance |
For churn, **ROC‑AUC** and **F1‑Score** are often most insightful.
---
## 6.8 Putting It All Together
```python
from sklearn.model_selection import cross_validate
from sklearn.metrics import roc_auc_score, f1_score, make_scorer
scoring = {
'roc_auc': 'roc_auc',
'f1': 'f1',
'accuracy': 'accuracy'
}
results = cross_validate(pipeline, X, y, cv=5, scoring=scoring, return_train_score=True)
print('Validation ROC‑AUC:', results['test_roc_auc'].mean())
print('Validation F1:', results['test_f1'].mean())
```
> **Interpretation**: A high ROC‑AUC but low F1 may indicate that the model is good at ranking but not at threshold‑based classification. Adjust the decision threshold or balance classes accordingly.
---
## 6.9 Reproducibility Checklist
| Item | Tool | How |
|------|------|-----|
| **Random seeds** | NumPy, Scikit‑learn | `np.random.seed(42)` |
| **Environment** | Conda / Poetry | `environment.yml` / `pyproject.toml` |
| **Code versioning** | Git | Commit every major change |
| **Pipeline serialization** | Joblib / MLflow | `joblib.dump(pipeline, 'model.pkl')` |
| **Data lineage** | DVC | Track data versions |
Reproducibility isn’t optional; it’s a cornerstone of responsible science.
---
## 6.10 Ethical Lens on Predictive Models
1. **Bias Audits** – Evaluate disparate impact across protected attributes.
2. **Transparency** – Use SHAP or LIME to explain predictions.
3. **Privacy** – Ensure compliance with GDPR / CCPA; avoid re‑identification.
4. **Fairness Constraints** – Incorporate group‑fairness objectives in the loss function.
5. **Human‑in‑the‑Loop** – Maintain domain experts in the decision cycle.
> **Case in point**: A churn model that disproportionately flags minority customers must be scrutinized and potentially adjusted.
---
## 6.11 Next Steps
With a robust, tuned, and ethically vetted model in hand, we’re ready to move into the final chapter: deploying the model into production, monitoring its performance, and iteratively improving it. In Chapter 7, we’ll explore containerization, model serving, A/B testing, and governance frameworks.
> **Remember**: The journey from raw data to actionable insight is iterative. Keep revisiting earlier steps when new data arrives or business objectives shift.
---
*End of Chapter 6.*