Chapter 11: Capstone Case Studies

發布於 2026-03-08 02:19

# Chapter 11: Capstone Case Studies In the previous chapters we built the technical foundation: from data acquisition to deployment and governance. Here we turn theory into practice with three industry‑specific projects that illustrate the entire data‑science lifecycle in context. Each case study follows the same template: 1. **Problem Definition** – What business question are we answering? 2. **Data Collection & Preparation** – Where does the data come from and how do we clean it? 3. **Exploratory Analysis** – Quick insights that shape feature engineering. 4. **Modeling & Evaluation** – Algorithm choice, hyper‑parameter tuning, metrics. 5. **Deployment & Monitoring** – Packaging, A/B testing, drift detection. 6. **Impact Measurement & Feedback** – How do we quantify success and close the loop? --- ## 11.1 Finance – Credit Risk Modeling ### 1. Problem Definition A mid‑size bank wants to predict the probability of default (PD) for new loan applicants to optimize approval rates and maintain regulatory capital. ### 2. Data Collection & Preparation | Source | Description | |--------|-------------| | Loan Application DB | 12‑month applicant demographics & financials | | Credit Bureau API | 3‑year credit history & delinquency records | | Internal Transaction Logs | Repayment behavior for existing customers | ```python import pandas as pd app = pd.read_csv('applications.csv') credit = pd.read_json('credit_api_response.json') ``` *Missing‑value strategy*: Impute numeric features with median, categorical with mode. Outliers flagged by Tukey’s fences are capped. ### 3. Exploratory Analysis ```python import seaborn as sns sns.boxplot(x='default', y='annual_income', data=app) ``` *Key insights*: High‑income customers exhibit lower default rates; a strong correlation exists between past‑delinquency count and current PD. ### 4. Feature Engineering | Feature | Transformation | |---------|----------------| | `age_bucket` | `pd.cut(age, bins=[18,25,35,45,55,65], labels=["18‑24",…])` | | `credit_score_z` | Standardize credit score | | `tenure_ratio` | `current_loan_amount / max_loan_amount` | ### 5. Modeling & Evaluation We compare Logistic Regression, Gradient Boosting (LightGBM), and a calibrated Random Forest. ```python from sklearn.model_selection import train_test_split, GridSearchCV from lightgbm import LGBMClassifier X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, stratify=y, random_state=42) params = {'learning_rate': [0.05,0.1], 'num_leaves':[31,63]} grid = GridSearchCV(LGBMClassifier(), params, cv=5, scoring='roc_auc') grid.fit(X_train, y_train) print('Best AUC:', grid.best_score_) ``` *Evaluation metrics*: - **ROC‑AUC**: 0.87 - **Kolmogorov–Smirnov (KS)**: 0.65 - **Calibration curve**: passes 5‑fold cross‑validated ### 6. Deployment & Monitoring The model is packaged as a Docker container, served via FastAPI. A *model‑monitor* microservice checks for **concept drift** using *Population Stability Index (PSI)* weekly. ```yaml # docker-compose.yml snippet services: credit-model: image: bank/credit-pred:latest environment: - PYTHONPATH=/app ports: - "8000:8000" ``` ### 7. Impact Measurement & Feedback | Metric | Target | Actual | Notes | |--------|--------|--------|-------| | PD Reduction | 2% | 1.8% | Achieved via stricter threshold | | Approval Rate | 75% | 73% | Slight dip but risk-adjusted ROI ↑ by 4% | | Customer Satisfaction | 4.5/5 | 4.3/5 | Minor decline, addressed with transparent scoring dashboard | Feedback loop: every 3 months we retrain on new data, re‑evaluate PSI, and update the credit‑score threshold. --- ## 11.2 Healthcare – Predictive Readmission ### 1. Problem Definition A hospital aims to reduce 30‑day readmission rates for heart‑failure patients by identifying high‑risk individuals at discharge. ### 2. Data Collection & Preparation - Electronic Health Records (EHR): vital signs, lab results, comorbidities. - Pharmacy dispensation logs. - Social determinants (zip‑code level income, insurance type). Data is anonymized using *k‑anonymity* before analysis. ### 3. Exploratory Analysis ```python import matplotlib.pyplot as plt plt.hist(df['length_of_stay'], bins=30) plt.title('LOS Distribution') ``` Observation: LOS > 10 days correlates with higher readmission. ### 4. Feature Engineering - **Temporal aggregates**: mean, std of vitals over last 3 days. - **Comorbidity score**: Charlson index. - **Medication adherence**: proportion of prescribed meds taken. ### 5. Modeling & Evaluation We deploy a **Gradient Boosting Machine** (XGBoost) with *AUC* and *Sensitivity at 90% Specificity* as primary metrics. ```python import xgboost as xgb clf = xgb.XGBClassifier(max_depth=5, n_estimators=300, learning_rate=0.1) clf.fit(X_train, y_train) pred = clf.predict_proba(X_test)[:,1] ``` Results: - **AUC**: 0.81 - **Sensitivity@90% Spec**: 0.68 ### 6. Deployment & Monitoring Model wrapped in a *FHIR‑compatible* API; integrated into the hospital’s clinical decision support system (CDSS). *Drift detection* uses *Adaptive Windowing (ADWIN)* on key clinical features. ### 7. Impact Measurement & Feedback | KPI | Target | Achieved | |-----|--------|----------| | Readmission Rate | ↓5% | ↓4.7% | | Early Discharge Efficiency | ↑10% | ↑12% | | Clinician Acceptance | 80% | 85% | Every month clinicians flag false positives; these are used to fine‑tune the decision threshold via *online learning*. --- ## 11.3 Retail – Demand Forecasting ### 1. Problem Definition A national retailer wants to optimize inventory for its flagship product line across 1,200 stores. ### 2. Data Collection & Preparation - POS sales data (daily SKU level) - Promotional calendar (discounts, in‑store events) - Weather API (temperature, precipitation) - Social media sentiment (Twitter) Data merged on `date` and `store_id`. ### 3. Exploratory Analysis ```python import fbprophet as Prophet model = Prophet() model.fit(df[['ds','y']]) ``` Seasonality: strong weekly cycle, holiday peaks. ### 4. Feature Engineering | Feature | Source | |---------|--------| | `is_holiday` | Holiday calendar | | `promo_weight` | Discount percentage | | `temp_bin` | Temperature buckets | | `sentiment_score` | VADER sentiment on brand mentions | ### 5. Modeling & Evaluation We compare **Prophet**, **ARIMA**, and **LSTM**. | Model | RMSE | MAPE | |-------|------|------| | Prophet | 12.5 | 4.8% | | ARIMA | 15.3 | 6.1% | | LSTM | 13.8 | 5.3% | Prophet chosen for its interpretability and seasonal components. ### 6. Deployment & Monitoring Model outputs are pushed to a **Snowflake** data warehouse nightly; an automated Tableau dashboard visualizes forecast vs. actual. Drift check: **Kolmogorov–Smirnov** on sales residuals weekly; if KS > 0.1, we trigger a retrain. ### 7. Impact Measurement & Feedback | Metric | Target | Result | |--------|--------|--------| | Stock‑out Rate | <2% | 1.8% | | Inventory Holding Cost | ↓3% | ↓3.5% | | Forecast Accuracy | RMSE < 13 | 12.7 | Store managers provide quarterly feedback on forecast relevance; we adjust the `promo_weight` feature accordingly. --- ## 11.4 Cross‑Domain Lessons | Lesson | Explanation | |--------|-------------| | End‑to‑End Pipelines | All projects followed a reproducible ETL, modeling, and deployment pipeline; versioning data with DVC and code with Git ensures traceability. | | Governance & Ethics | Patient data used k‑anonymity, credit data complied with GDPR; bias audits were performed on feature importances. | | Monitoring & Drift | PSI for finance, ADWIN for healthcare, KS for retail – each domain adopted an appropriate drift detector linked to automated retraining. | | Impact Quantification | Business KPIs (e.g., readmission rate, inventory cost) were paired with statistical confidence intervals to make actionable decisions. | | Feedback Loops | Regular stakeholder reviews and automated data pipelines close the loop, turning insights into continuous improvement. | --- ## Take‑away - **Start with a clear business problem** and align all technical decisions to that goal. - **Build reproducible data pipelines**; document every transformation for auditability. - **Choose metrics that matter to the business**—not just AUC or RMSE. - **Deploy with monitoring**; select drift detection techniques suited to your domain. - **Close the loop** through regular stakeholder feedback and automated retraining. - **Maintain ethical standards**—anonymize sensitive data, audit for bias, and document governance. By following these practices, you can transform raw data into strategic, sustainable value across any industry. The forest of data science you tended in Chapter 10 will keep bearing fruit, guided by robust metrics, vigilant monitoring, and continuous learning.

Harvesting Insight: Measuring Impact and Scaling the Data‑Science Forest