Chapter 6: From Features to Forecasts: Building Predictive Models

發布於 2026-03-06 21:37

# Chapter 6: From Features to Forecasts: Building Predictive Models In the previous chapters we walked through data acquisition, cleaning, and feature engineering. We left the dataset polished and ready for the next logical step: **translating those engineered features into predictive power**. In this chapter we’ll dive into supervised learning, discuss model selection, hyper‑parameter tuning, and validation strategies. We’ll also keep an eye on reproducibility and ethics as we build, test, and refine our models. --- ## 6.1 Recap of Our Prepared Dataset ```python import pandas as pd # Load the engineered dataset X = pd.read_csv("/data/engineered_features.csv") y = pd.read_csv("/data/target.csv") print(X.shape, y.shape) # (12, 4500) (12,) ``` > **Why this matters**: Having a clean, consistent, and well‑documented feature set is the foundation for any reliable predictive model. Each column now represents either a raw, derived, or transformed variable that carries meaningful information about our problem. --- ## 6.2 Choosing the Right Algorithm | Problem type | Typical algorithms | When to start here | |--------------|--------------------|---------------------| | **Regression** | Linear Regression, Lasso, ElasticNet, Gradient Boosting (XGBoost, LightGBM) | When the target is continuous and relationships look linear or mildly nonlinear | | **Classification** | Logistic Regression, Random Forest, Gradient Boosting, Support Vector Machines | When the target is categorical (binary or multiclass) and we want interpretability or high accuracy | | **Time‑Series Forecasting** | ARIMA, Prophet, LSTM | When data points have a temporal dependency | For our example—a binary churn prediction task—we’ll start with a **Logistic Regression** baseline, then move to **Gradient Boosting** to capture complex interactions. --- ## 6.3 Building a Reproducible Pipeline ```python from sklearn.pipeline import Pipeline from sklearn.preprocessing import StandardScaler from sklearn.linear_model import LogisticRegression pipeline = Pipeline([ ('scaler', StandardScaler()), ('clf', LogisticRegression(solver='lbfgs', max_iter=200)) ]) ``` > **Tip**: Encapsulating preprocessing and the estimator inside a pipeline guarantees that every step uses the same transformations during training and inference. --- ## 6.4 Hyper‑Parameter Tuning We’ll use **RandomizedSearchCV** for speed and **Optuna** for more sophisticated Bayesian optimization. ```python from sklearn.model_selection import RandomizedSearchCV from scipy.stats import uniform, randint param_grid = { 'clf__C': uniform(0.01, 10), 'clf__penalty': ['l1', 'l2'] } search = RandomizedSearchCV(pipeline, param_grid, n_iter=50, scoring='roc_auc', cv=5, verbose=1) search.fit(X, y) print("Best params:", search.best_params_) ``` ### Optuna Example ```python import optuna from optuna.samplers import TPESampler def objective(trial): C = trial.suggest_loguniform('C', 1e-3, 10) penalty = trial.suggest_categorical('penalty', ['l1', 'l2']) clf = LogisticRegression(C=C, penalty=penalty, solver='saga', max_iter=200) pipeline = Pipeline([('scaler', StandardScaler()), ('clf', clf)]) score = cross_val_score(pipeline, X, y, cv=5, scoring='roc_auc').mean() return score study = optuna.create_study(direction='maximize', sampler=TPESampler()) study.optimize(objective, n_trials=100) print('Best trial:', study.best_trial.params) ``` > **Why tune**: Even a modest change in hyper‑parameters can shift the balance between bias and variance, especially for regularized models. --- ## 6.5 Validation Strategies | Technique | When to Use | Key Considerations | |-----------|-------------|--------------------| | **Hold‑out** | Small datasets | Fast, but may be high variance | | **k‑Fold CV** | Medium to large datasets | Balanced estimate, but assumes i.i.d. | | **Stratified k‑Fold** | Classification tasks | Maintains class distribution | | **Time‑Series Split** | Temporal data | Prevents look‑ahead bias | We’ll stick with **Stratified 5‑Fold CV** for churn prediction. --- ## 6.6 Avoiding Overfitting 1. **Feature Selection** – Use `SelectKBest` or `RFE`. 2. **Regularization** – L1 (Lasso) can zero‑out irrelevant features. 3. **Early Stopping** – For tree‑based models like XGBoost. 4. **Cross‑Validation** – Keep a separate validation set that the model never sees. 5. **Ensemble Methods** – Combine predictions from diverse models. ```python from sklearn.feature_selection import RFE from sklearn.ensemble import RandomForestClassifier rf = RandomForestClassifier(n_estimators=200, random_state=42) selector = RFE(rf, n_features_to_select=50) X_selected = selector.fit_transform(X, y) ``` --- ## 6.7 Model Evaluation Metrics | Metric | Formula | When to prefer | |--------|---------|----------------| | **Accuracy** | (TP+TN)/(P+N) | Balanced classes | | **Precision** | TP/(TP+FP) | When false positives are costly | | **Recall (Sensitivity)** | TP/(TP+FN) | When false negatives are costly | | **F1‑Score** | 2*(Precision*Recall)/(Precision+Recall) | Harmonic mean of precision & recall | | **ROC‑AUC** | Area under ROC curve | Probability outputs, class imbalance | | **PR‑AUC** | Area under Precision‑Recall curve | Severe class imbalance | For churn, **ROC‑AUC** and **F1‑Score** are often most insightful. --- ## 6.8 Putting It All Together ```python from sklearn.model_selection import cross_validate from sklearn.metrics import roc_auc_score, f1_score, make_scorer scoring = { 'roc_auc': 'roc_auc', 'f1': 'f1', 'accuracy': 'accuracy' } results = cross_validate(pipeline, X, y, cv=5, scoring=scoring, return_train_score=True) print('Validation ROC‑AUC:', results['test_roc_auc'].mean()) print('Validation F1:', results['test_f1'].mean()) ``` > **Interpretation**: A high ROC‑AUC but low F1 may indicate that the model is good at ranking but not at threshold‑based classification. Adjust the decision threshold or balance classes accordingly. --- ## 6.9 Reproducibility Checklist | Item | Tool | How | |------|------|-----| | **Random seeds** | NumPy, Scikit‑learn | `np.random.seed(42)` | | **Environment** | Conda / Poetry | `environment.yml` / `pyproject.toml` | | **Code versioning** | Git | Commit every major change | | **Pipeline serialization** | Joblib / MLflow | `joblib.dump(pipeline, 'model.pkl')` | | **Data lineage** | DVC | Track data versions | Reproducibility isn’t optional; it’s a cornerstone of responsible science. --- ## 6.10 Ethical Lens on Predictive Models 1. **Bias Audits** – Evaluate disparate impact across protected attributes. 2. **Transparency** – Use SHAP or LIME to explain predictions. 3. **Privacy** – Ensure compliance with GDPR / CCPA; avoid re‑identification. 4. **Fairness Constraints** – Incorporate group‑fairness objectives in the loss function. 5. **Human‑in‑the‑Loop** – Maintain domain experts in the decision cycle. > **Case in point**: A churn model that disproportionately flags minority customers must be scrutinized and potentially adjusted. --- ## 6.11 Next Steps With a robust, tuned, and ethically vetted model in hand, we’re ready to move into the final chapter: deploying the model into production, monitoring its performance, and iteratively improving it. In Chapter 7, we’ll explore containerization, model serving, A/B testing, and governance frameworks. > **Remember**: The journey from raw data to actionable insight is iterative. Keep revisiting earlier steps when new data arrives or business objectives shift. --- *End of Chapter 6.*

Chapter 5: Feature Engineering & Dimensionality Reduction

Chapter 7: From Model to Production—Deployment, Monitoring, and Governance