Chapter 9: Analytics in Action: Case Studies

發布於 2026-03-02 17:04

# Chapter 9: Analytics in Action: Case Studies In this chapter we walk through four detailed, industry‑specific projects that demonstrate the full analytics lifecycle—from problem definition through deployment—while weaving in the ethical, governance, and operational considerations discussed in Chapter 8. Each case study is structured around the same five‑step framework: 1. **Business Objective & Question Formulation** 2. **Data Acquisition & Pipeline Construction** 3. **Exploratory Analysis & Feature Engineering** 4. **Model Development, Evaluation & Explainability** 5. **Deployment, Monitoring & Ethical Governance** We will use real‑world data sources where possible, and include reproducible Python code snippets and pseudocode to illustrate key techniques. --- ## 1. Retail – Predicting Customer Churn for a Subscription‑Based Streaming Service | **Phase** | **Key Activities** | **Tools/Tech** | |-----------|--------------------|----------------| | Problem Definition | Identify which subscribers are at risk of cancelling within 30 days | Stakeholder interviews, churn definition document | | Data | Historical subscription logs, usage metrics, support tickets | PostgreSQL, Apache Airflow | | Pipeline | Ingest, cleanse, compute engagement scores, aggregate into a monthly snapshot | dbt, Pandas, Spark | | Modeling | Logistic regression, XGBoost, SHAP for feature importance | scikit‑learn, xgboost, shap | | Deployment | API via FastAPI, Slack alerts for high‑risk users | Docker, Kubernetes, Prometheus | | Governance | Bias audit for demographic features, privacy‑preserving hashing | Differential Privacy (PyDP), GDPR checklist | | Outcome | 12% reduction in churn within 3 months | KPI dashboard | ### 1.1 Problem Definition The streaming platform faces a high churn rate among newly acquired subscribers. The business objective is to reduce churn by 10% over the next fiscal year. We frame the analytical question: > *Which subscribers are most likely to cancel in the next 30 days, and what actions can be taken to retain them?* ### 1.2 Data Acquisition & Pipeline python # Ingest raw logs from Kafka topic into a raw table spark.read.format("kafka")\ .option("kafka.bootstrap.servers", "broker1:9092")\ .option("subscribe", "subscription_events")\ .load()\ .selectExpr("CAST(value AS STRING) as json_str")\ .select(from_json(col("json_str"), schema).alias("data"))\ .select("data.*")\ .write\ .format("delta")\ .mode("append")\ .save("/mnt/delta/raw/subscription_events") Using **dbt** we transform the raw table into a curated view: sql -- models/subscriber_profile.sql WITH base AS ( SELECT subscriber_id, MIN(start_date) AS first_subscription_date, MAX(end_date) AS last_subscription_date, COUNT(*) AS total_sessions, AVG(session_duration) AS avg_session FROM {{ ref('raw_subscription_events') }} GROUP BY subscriber_id ) SELECT * FROM base ### 1.3 Exploratory Analysis & Feature Engineering python import pandas as pd import seaborn as sns df = pd.read_sql("SELECT * FROM {{ ref('subscriber_profile') }}", con=engine) # Detect churn df['churned'] = df['last_subscription_date'] < (pd.Timestamp('today') - pd.Timedelta(days=30)) # Visualize key drivers sns.boxplot(x='churned', y='total_sessions', data=df) Feature engineering steps include: - **Engagement score**: weighted sum of daily active minutes. - **Recency, Frequency, Monetary (RFM)** values. - **Support ticket count** in the last 60 days. ### 1.4 Model Development & Explainability python from sklearn.model_selection import train_test_split from sklearn.preprocessing import StandardScaler from xgboost import XGBClassifier import shap X = df.drop(columns=['subscriber_id', 'churned']) y = df['churned'] X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42) scaler = StandardScaler() X_train = scaler.fit_transform(X_train) X_test = scaler.transform(X_test) model = XGBClassifier(max_depth=4, learning_rate=0.1, n_estimators=200) model.fit(X_train, y_train) # SHAP explanations explainer = shap.TreeExplainer(model) shap_values = explainer.shap_values(X_test) shap.summary_plot(shap_values, X_test, feature_names=X.columns) Performance metrics: ROC‑AUC = 0.87, precision@k (k=200) = 0.68. ### 1.5 Deployment & Governance - The model is exported as a **ONNX** bundle and served via FastAPI. - A **Slack bot** triggers when a subscriber's predicted churn probability > 0.7. - **Bias audit**: model predictions are stratified by gender and region; no statistically significant disparity after re‑balancing. - **Privacy**: subscriber IDs are hashed with SHA‑256 before being stored in the API logs. --- ## 2. Finance – Credit Risk Assessment for a Peer‑to‑Peer Lending Platform | **Phase** | **Key Activities** | **Tools/Tech** | |-----------|--------------------|----------------| | Problem Definition | Predict probability of default within 12 months | Credit scoring policy | | Data | Loan origination records, payment history, external credit bureaus | Snowflake, Fivetran | | Pipeline | Data lake ingestion, feature store, versioned models | DataRobot, Hopsworks | | Modeling | Gradient Boosting, GLM, fairness constraints | scikit‑learn, Fairlearn | | Deployment | Batch scoring via Airflow, real‑time API | FastAPI, Docker | | Governance | Explainable AI, audit trails, regulatory reporting | SHAP, Evidently AI | | Outcome | 15% reduction in default rate, 2% higher NPV | Risk‑adjusted return | ### 2.1 Problem Definition The platform seeks to refine its credit decisioning to balance risk and growth. The analytical question: > *Given borrower attributes, can we estimate a probability of default (PD) that aligns with the institution’s risk appetite and regulatory constraints?* ### 2.2 Data Acquisition & Pipeline Data from multiple sources is integrated into Snowflake via Fivetran. A **feature store** (Hopsworks) keeps versioned features: sql -- Feature: debt_to_income_ratio SELECT borrower_id, SUM(debt_amount) / SUM(annual_income) AS dti FROM loan_payments GROUP BY borrower_id; The feature store schema includes: - **Static**: borrower demographics, credit bureau scores. - **Dynamic**: recent payment behavior, account balances. ### 2.3 Exploratory Analysis & Feature Engineering We conduct a **missingness matrix** to decide on imputation strategies and build **interaction terms** between employment stability and DTI. We also compute **credit utilization ratios** per credit line. ### 2.4 Model Development & Fairness python from sklearn.ensemble import GradientBoostingClassifier from fairlearn.metrics import demographic_parity_difference X = df.drop(columns=['default']) y = df['default'] model = GradientBoostingClassifier(n_estimators=300, learning_rate=0.05) model.fit(X, y) # Fairness metric: demographic parity across age groups dp_diff = demographic_parity_difference(y_true=y, y_pred=model.predict(X), sensitive_features=df['age_group']) print(f"Demographic parity difference: {dp_diff:.3f}") If disparity > 0.05, we re‑weight the training set or add a fairness constraint via `fairlearn`. The final model achieves AUC = 0.92 and parity difference = 0.02. ### 2.5 Deployment & Governance - Scores are batched nightly using Airflow and updated in the risk score table. - A REST endpoint (FastAPI) provides real‑time PD for new loan requests. - **Explainability**: SHAP force plots accompany each score in the risk dashboard. - **Audit trail**: Every score, model version, and feature extraction timestamp is logged in a tamper‑evident ledger. - **Regulatory reporting**: Evidently AI generates KPI reports for Basel III compliance. --- ## 3. Healthcare – Early Detection of Sepsis in ICU Patients | **Phase** | **Key Activities** | **Tools/Tech** | |-----------|--------------------|----------------| | Problem Definition | Predict onset of sepsis 6 hours before clinical diagnosis | Clinical workflow analysis | | Data | Electronic Health Records (EHR), vital signs, lab results | RedCap, Mirth Connect | | Pipeline | Real‑time ingestion, feature extraction, model scoring | FHIR API, TensorFlow Extended (TFX) | | Modeling | Recurrent neural networks (LSTM), temporal feature engineering | Keras, scikit‑time | | Deployment | EHR dashboard alerts, nurse notification system | FHIR, HL7 v2, PagerDuty | | Governance | Patient privacy (HIPAA), explainability | LIME, SHAP, de‑identification pipeline | | Outcome | 30% reduction in time to treatment, improved survival | Clinical outcome metrics | ### 3.1 Problem Definition Sepsis is a life‑threatening condition that requires rapid intervention. The analytical goal: > *Can we build a predictive model that flags patients likely to develop sepsis 6 hours before clinical diagnosis, allowing clinicians to intervene earlier?* ### 3.2 Data Acquisition & Pipeline We connect to the hospital’s FHIR server to stream real‑time vitals and labs: python import fhirclient.client as fhir server = fhir.FHIRClient(settings={'app_id': 'myapp', 'api_base': 'https://ehr.example.com/fhir'} ) # Stream vital signs every minute while True: obs = server.request('Observation', parameters={'patient': patient_id, 'code': 'vital-signs'}) # Persist to Delta Lake write_to_delta(obs) time.sleep(60) The **TFX** pipeline includes data validation, feature transformation (time‑window aggregates), and model training. ### 3.3 Exploratory Analysis & Temporal Features We compute rolling statistics over 6‑hour windows: - Mean, standard deviation, trend slope for heart rate, temperature, MAP. - Interaction terms between lactate and white blood cell count. Visualization of sepsis onset timelines reveals a sharp spike in lactate 4 hours before diagnosis. ### 3.4 Model Development & Explainability python from tensorflow.keras.models import Sequential from tensorflow.keras.layers import LSTM, Dense model = Sequential([ LSTM(64, input_shape=(window_size, feature_dim)), Dense(32, activation='relu'), Dense(1, activation='sigmoid') ]) model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['AUC']) model.fit(train_ds, epochs=20, validation_data=val_ds) # Explain with SHAP explainer = shap.DeepExplainer(model, train_ds.take(1)) shap_values = explainer.shap_values(sample) shap.force_plot(explainer.expected_value[0], shap_values[0], sample) The LSTM achieves an AUC = 0.93 and precision‑recall curve that meets the hospital’s operating threshold (precision ≥ 0.75 at recall ≥ 0.6). ### 3.5 Deployment & Governance - The model is packaged as a Docker container and deployed to the hospital’s Kubernetes cluster. - Alerts are sent to the bedside monitor and a pager system via HL7 v2 messages. - All patient identifiers are hashed; PHI is removed before model input. - **Explainability**: LIME snippets are displayed in the EHR UI for the attending physician. - **Audit**: Every alert, model version, and input data snapshot is logged in a secure, tamper‑evident database. --- ## 4. Marketing – Optimizing Customer Acquisition Costs (CAC) for a SaaS Product | **Phase** | **Key Activities** | **Tools/Tech** | |-----------|--------------------|----------------| | Problem Definition | Reduce CAC by identifying high‑value leads | Marketing KPI review | | Data | CRM logs, web analytics, advertising spend | Google Analytics, Segment, Snowflake | | Pipeline | Real‑time lead scoring, attribution modeling | Airbyte, dbt, Snowplow | | Modeling | Multi‑arm bandit for campaign allocation, linear models for cost estimation | Optuna, scikit‑opt, statsmodels | | Deployment | Campaign manager dashboard, automated bid adjustments | Marketing Automation, Zapier | | Governance | Consent management, data minimization | OneTrust, Privacy‑by‑Design | | Outcome | 22% drop in CAC, 18% increase in qualified leads | ROI metrics | ### 4.1 Problem Definition Marketing aims to allocate limited ad spend across channels to maximize lead quality. The core analytical question: > *Which ad creative and channel combinations produce leads with the highest conversion probability while keeping CAC below the target threshold?* ### 4.2 Data Acquisition & Pipeline We ingest clickstream and conversion events from Snowplow via Airbyte into Snowflake. The feature store stores **source‑specific** and **aggregate** metrics: sql SELECT visitor_id, campaign_id, channel, SUM(clicks) AS total_clicks, SUM(conversions) AS total_conversions, AVG(session_duration) AS avg_session FROM snowplow_events GROUP BY visitor_id, campaign_id, channel; ### 4.3 Exploratory Analysis & Attribution We build a **shapley attribution** model to allocate credit to each channel. The analysis reveals that social media drives 30% of conversions but at higher cost, while organic search yields lower cost per conversion. ### 4.4 Optimization & Modeling We frame the campaign allocation as a **multi‑armed bandit** problem: python import optuna def objective(trial): # Allocate budget across channels budgets = { 'social': trial.suggest_float('social', 0.1, 0.5), 'search': trial.suggest_float('search', 0.1, 0.5), 'display': trial.suggest_float('display', 0.0, 0.3) } # Simulate returns based on historical CTR/CR total_cac = simulate_cac(budgets) return -total_cac # minimize CAC study = optuna.create_study() study.optimize(objective, n_trials=100) The optimal allocation reduces CAC by 22% while maintaining conversion rates. ### 4.5 Deployment & Governance - A **Zapier** integration pushes budget allocations to the ad platform API. - A dashboard built in Metabase displays real‑time CAC, CPL, and lead quality. - **Consent** is tracked via Segment’s user‑profile flags, ensuring only opted‑in users are targeted. - Data minimization: personally identifiable fields are hashed; only hashed identifiers are stored. --- ## 5. Synthesis & Lessons Learned | **Industry** | **Key Success Factors** | **Common Ethical Challenges** | |--------------|------------------------|------------------------------| | Retail | Real‑time engagement scoring, clear opt‑in for churn alerts | Risk of profiling, data security | | Finance | Fairness constraints, audit trails for regulatory reporting | Bias mitigation, transparency | | Healthcare | Temporal modeling, stringent privacy controls | HIPAA compliance, explainability | | Marketing | Attribution accuracy, consent management | Data minimization, user trust | ### 5.1 Cross‑Industry Themes 1. **End‑to‑End Automation** – From ingestion to model scoring, pipelines should be reproducible, versioned, and observable. 2. **Bias & Fairness** – Early bias audits prevent downstream discrimination. 3. **Explainability** – SHAP, LIME, or domain‑specific tools bridge the gap between model predictions and stakeholder trust. 4. **Governance & Privacy** – Immutable logs, data minimization, and compliance frameworks are non‑negotiable across sectors. 5. **Operational Integration** – Models must fit seamlessly into existing workflows; otherwise, the analytical value is lost. ### 5.2 Final Thought Analytics in action is not a set of isolated models; it is a holistic ecosystem that intertwines data, people, processes, and policy. By embedding responsibility at every stage—ethical design, transparent evaluation, and robust deployment—we ensure that the transformative power of data serves both business objectives and societal good. --- > *“The best analytics practice is to ask not only what the data says, but how it can be trusted.”* –墨羽行

Chapter 8: Ethics, Governance & Responsible Analytics

Chapter 10: Scaling the Alchemy—From Prototype to Enterprise