返回目錄
A
Analytics Alchemy: Turning Data into Strategic Advantage - 第 9 章
Chapter 9: Analytics in Action: Case Studies
發布於 2026-03-02 17:04
# Chapter 9: Analytics in Action: Case Studies
In this chapter we walk through four detailed, industry‑specific projects that demonstrate the full analytics lifecycle—from problem definition through deployment—while weaving in the ethical, governance, and operational considerations discussed in Chapter 8. Each case study is structured around the same five‑step framework:
1. **Business Objective & Question Formulation**
2. **Data Acquisition & Pipeline Construction**
3. **Exploratory Analysis & Feature Engineering**
4. **Model Development, Evaluation & Explainability**
5. **Deployment, Monitoring & Ethical Governance**
We will use real‑world data sources where possible, and include reproducible Python code snippets and pseudocode to illustrate key techniques.
---
## 1. Retail – Predicting Customer Churn for a Subscription‑Based Streaming Service
| **Phase** | **Key Activities** | **Tools/Tech** |
|-----------|--------------------|----------------|
| Problem Definition | Identify which subscribers are at risk of cancelling within 30 days | Stakeholder interviews, churn definition document |
| Data | Historical subscription logs, usage metrics, support tickets | PostgreSQL, Apache Airflow |
| Pipeline | Ingest, cleanse, compute engagement scores, aggregate into a monthly snapshot | dbt, Pandas, Spark |
| Modeling | Logistic regression, XGBoost, SHAP for feature importance | scikit‑learn, xgboost, shap |
| Deployment | API via FastAPI, Slack alerts for high‑risk users | Docker, Kubernetes, Prometheus |
| Governance | Bias audit for demographic features, privacy‑preserving hashing | Differential Privacy (PyDP), GDPR checklist |
| Outcome | 12% reduction in churn within 3 months | KPI dashboard |
### 1.1 Problem Definition
The streaming platform faces a high churn rate among newly acquired subscribers. The business objective is to reduce churn by 10% over the next fiscal year. We frame the analytical question:
> *Which subscribers are most likely to cancel in the next 30 days, and what actions can be taken to retain them?*
### 1.2 Data Acquisition & Pipeline
python
# Ingest raw logs from Kafka topic into a raw table
spark.read.format("kafka")\
.option("kafka.bootstrap.servers", "broker1:9092")\
.option("subscribe", "subscription_events")\
.load()\
.selectExpr("CAST(value AS STRING) as json_str")\
.select(from_json(col("json_str"), schema).alias("data"))\
.select("data.*")\
.write\
.format("delta")\
.mode("append")\
.save("/mnt/delta/raw/subscription_events")
Using **dbt** we transform the raw table into a curated view:
sql
-- models/subscriber_profile.sql
WITH base AS (
SELECT
subscriber_id,
MIN(start_date) AS first_subscription_date,
MAX(end_date) AS last_subscription_date,
COUNT(*) AS total_sessions,
AVG(session_duration) AS avg_session
FROM {{ ref('raw_subscription_events') }}
GROUP BY subscriber_id
)
SELECT * FROM base
### 1.3 Exploratory Analysis & Feature Engineering
python
import pandas as pd
import seaborn as sns
df = pd.read_sql("SELECT * FROM {{ ref('subscriber_profile') }}", con=engine)
# Detect churn
df['churned'] = df['last_subscription_date'] < (pd.Timestamp('today') - pd.Timedelta(days=30))
# Visualize key drivers
sns.boxplot(x='churned', y='total_sessions', data=df)
Feature engineering steps include:
- **Engagement score**: weighted sum of daily active minutes.
- **Recency, Frequency, Monetary (RFM)** values.
- **Support ticket count** in the last 60 days.
### 1.4 Model Development & Explainability
python
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from xgboost import XGBClassifier
import shap
X = df.drop(columns=['subscriber_id', 'churned'])
y = df['churned']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)
model = XGBClassifier(max_depth=4, learning_rate=0.1, n_estimators=200)
model.fit(X_train, y_train)
# SHAP explanations
explainer = shap.TreeExplainer(model)
shap_values = explainer.shap_values(X_test)
shap.summary_plot(shap_values, X_test, feature_names=X.columns)
Performance metrics: ROC‑AUC = 0.87, precision@k (k=200) = 0.68.
### 1.5 Deployment & Governance
- The model is exported as a **ONNX** bundle and served via FastAPI.
- A **Slack bot** triggers when a subscriber's predicted churn probability > 0.7.
- **Bias audit**: model predictions are stratified by gender and region; no statistically significant disparity after re‑balancing.
- **Privacy**: subscriber IDs are hashed with SHA‑256 before being stored in the API logs.
---
## 2. Finance – Credit Risk Assessment for a Peer‑to‑Peer Lending Platform
| **Phase** | **Key Activities** | **Tools/Tech** |
|-----------|--------------------|----------------|
| Problem Definition | Predict probability of default within 12 months | Credit scoring policy |
| Data | Loan origination records, payment history, external credit bureaus | Snowflake, Fivetran |
| Pipeline | Data lake ingestion, feature store, versioned models | DataRobot, Hopsworks |
| Modeling | Gradient Boosting, GLM, fairness constraints | scikit‑learn, Fairlearn |
| Deployment | Batch scoring via Airflow, real‑time API | FastAPI, Docker |
| Governance | Explainable AI, audit trails, regulatory reporting | SHAP, Evidently AI |
| Outcome | 15% reduction in default rate, 2% higher NPV | Risk‑adjusted return |
### 2.1 Problem Definition
The platform seeks to refine its credit decisioning to balance risk and growth. The analytical question:
> *Given borrower attributes, can we estimate a probability of default (PD) that aligns with the institution’s risk appetite and regulatory constraints?*
### 2.2 Data Acquisition & Pipeline
Data from multiple sources is integrated into Snowflake via Fivetran. A **feature store** (Hopsworks) keeps versioned features:
sql
-- Feature: debt_to_income_ratio
SELECT borrower_id,
SUM(debt_amount) / SUM(annual_income) AS dti
FROM loan_payments
GROUP BY borrower_id;
The feature store schema includes:
- **Static**: borrower demographics, credit bureau scores.
- **Dynamic**: recent payment behavior, account balances.
### 2.3 Exploratory Analysis & Feature Engineering
We conduct a **missingness matrix** to decide on imputation strategies and build **interaction terms** between employment stability and DTI. We also compute **credit utilization ratios** per credit line.
### 2.4 Model Development & Fairness
python
from sklearn.ensemble import GradientBoostingClassifier
from fairlearn.metrics import demographic_parity_difference
X = df.drop(columns=['default'])
y = df['default']
model = GradientBoostingClassifier(n_estimators=300, learning_rate=0.05)
model.fit(X, y)
# Fairness metric: demographic parity across age groups
dp_diff = demographic_parity_difference(y_true=y, y_pred=model.predict(X), sensitive_features=df['age_group'])
print(f"Demographic parity difference: {dp_diff:.3f}")
If disparity > 0.05, we re‑weight the training set or add a fairness constraint via `fairlearn`. The final model achieves AUC = 0.92 and parity difference = 0.02.
### 2.5 Deployment & Governance
- Scores are batched nightly using Airflow and updated in the risk score table.
- A REST endpoint (FastAPI) provides real‑time PD for new loan requests.
- **Explainability**: SHAP force plots accompany each score in the risk dashboard.
- **Audit trail**: Every score, model version, and feature extraction timestamp is logged in a tamper‑evident ledger.
- **Regulatory reporting**: Evidently AI generates KPI reports for Basel III compliance.
---
## 3. Healthcare – Early Detection of Sepsis in ICU Patients
| **Phase** | **Key Activities** | **Tools/Tech** |
|-----------|--------------------|----------------|
| Problem Definition | Predict onset of sepsis 6 hours before clinical diagnosis | Clinical workflow analysis |
| Data | Electronic Health Records (EHR), vital signs, lab results | RedCap, Mirth Connect |
| Pipeline | Real‑time ingestion, feature extraction, model scoring | FHIR API, TensorFlow Extended (TFX) |
| Modeling | Recurrent neural networks (LSTM), temporal feature engineering | Keras, scikit‑time |
| Deployment | EHR dashboard alerts, nurse notification system | FHIR, HL7 v2, PagerDuty |
| Governance | Patient privacy (HIPAA), explainability | LIME, SHAP, de‑identification pipeline |
| Outcome | 30% reduction in time to treatment, improved survival | Clinical outcome metrics |
### 3.1 Problem Definition
Sepsis is a life‑threatening condition that requires rapid intervention. The analytical goal:
> *Can we build a predictive model that flags patients likely to develop sepsis 6 hours before clinical diagnosis, allowing clinicians to intervene earlier?*
### 3.2 Data Acquisition & Pipeline
We connect to the hospital’s FHIR server to stream real‑time vitals and labs:
python
import fhirclient.client as fhir
server = fhir.FHIRClient(settings={'app_id': 'myapp', 'api_base': 'https://ehr.example.com/fhir'} )
# Stream vital signs every minute
while True:
obs = server.request('Observation', parameters={'patient': patient_id, 'code': 'vital-signs'})
# Persist to Delta Lake
write_to_delta(obs)
time.sleep(60)
The **TFX** pipeline includes data validation, feature transformation (time‑window aggregates), and model training.
### 3.3 Exploratory Analysis & Temporal Features
We compute rolling statistics over 6‑hour windows:
- Mean, standard deviation, trend slope for heart rate, temperature, MAP.
- Interaction terms between lactate and white blood cell count.
Visualization of sepsis onset timelines reveals a sharp spike in lactate 4 hours before diagnosis.
### 3.4 Model Development & Explainability
python
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import LSTM, Dense
model = Sequential([
LSTM(64, input_shape=(window_size, feature_dim)),
Dense(32, activation='relu'),
Dense(1, activation='sigmoid')
])
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['AUC'])
model.fit(train_ds, epochs=20, validation_data=val_ds)
# Explain with SHAP
explainer = shap.DeepExplainer(model, train_ds.take(1))
shap_values = explainer.shap_values(sample)
shap.force_plot(explainer.expected_value[0], shap_values[0], sample)
The LSTM achieves an AUC = 0.93 and precision‑recall curve that meets the hospital’s operating threshold (precision ≥ 0.75 at recall ≥ 0.6).
### 3.5 Deployment & Governance
- The model is packaged as a Docker container and deployed to the hospital’s Kubernetes cluster.
- Alerts are sent to the bedside monitor and a pager system via HL7 v2 messages.
- All patient identifiers are hashed; PHI is removed before model input.
- **Explainability**: LIME snippets are displayed in the EHR UI for the attending physician.
- **Audit**: Every alert, model version, and input data snapshot is logged in a secure, tamper‑evident database.
---
## 4. Marketing – Optimizing Customer Acquisition Costs (CAC) for a SaaS Product
| **Phase** | **Key Activities** | **Tools/Tech** |
|-----------|--------------------|----------------|
| Problem Definition | Reduce CAC by identifying high‑value leads | Marketing KPI review |
| Data | CRM logs, web analytics, advertising spend | Google Analytics, Segment, Snowflake |
| Pipeline | Real‑time lead scoring, attribution modeling | Airbyte, dbt, Snowplow |
| Modeling | Multi‑arm bandit for campaign allocation, linear models for cost estimation | Optuna, scikit‑opt, statsmodels |
| Deployment | Campaign manager dashboard, automated bid adjustments | Marketing Automation, Zapier |
| Governance | Consent management, data minimization | OneTrust, Privacy‑by‑Design |
| Outcome | 22% drop in CAC, 18% increase in qualified leads | ROI metrics |
### 4.1 Problem Definition
Marketing aims to allocate limited ad spend across channels to maximize lead quality. The core analytical question:
> *Which ad creative and channel combinations produce leads with the highest conversion probability while keeping CAC below the target threshold?*
### 4.2 Data Acquisition & Pipeline
We ingest clickstream and conversion events from Snowplow via Airbyte into Snowflake. The feature store stores **source‑specific** and **aggregate** metrics:
sql
SELECT
visitor_id,
campaign_id,
channel,
SUM(clicks) AS total_clicks,
SUM(conversions) AS total_conversions,
AVG(session_duration) AS avg_session
FROM snowplow_events
GROUP BY visitor_id, campaign_id, channel;
### 4.3 Exploratory Analysis & Attribution
We build a **shapley attribution** model to allocate credit to each channel. The analysis reveals that social media drives 30% of conversions but at higher cost, while organic search yields lower cost per conversion.
### 4.4 Optimization & Modeling
We frame the campaign allocation as a **multi‑armed bandit** problem:
python
import optuna
def objective(trial):
# Allocate budget across channels
budgets = {
'social': trial.suggest_float('social', 0.1, 0.5),
'search': trial.suggest_float('search', 0.1, 0.5),
'display': trial.suggest_float('display', 0.0, 0.3)
}
# Simulate returns based on historical CTR/CR
total_cac = simulate_cac(budgets)
return -total_cac # minimize CAC
study = optuna.create_study()
study.optimize(objective, n_trials=100)
The optimal allocation reduces CAC by 22% while maintaining conversion rates.
### 4.5 Deployment & Governance
- A **Zapier** integration pushes budget allocations to the ad platform API.
- A dashboard built in Metabase displays real‑time CAC, CPL, and lead quality.
- **Consent** is tracked via Segment’s user‑profile flags, ensuring only opted‑in users are targeted.
- Data minimization: personally identifiable fields are hashed; only hashed identifiers are stored.
---
## 5. Synthesis & Lessons Learned
| **Industry** | **Key Success Factors** | **Common Ethical Challenges** |
|--------------|------------------------|------------------------------|
| Retail | Real‑time engagement scoring, clear opt‑in for churn alerts | Risk of profiling, data security |
| Finance | Fairness constraints, audit trails for regulatory reporting | Bias mitigation, transparency |
| Healthcare | Temporal modeling, stringent privacy controls | HIPAA compliance, explainability |
| Marketing | Attribution accuracy, consent management | Data minimization, user trust |
### 5.1 Cross‑Industry Themes
1. **End‑to‑End Automation** – From ingestion to model scoring, pipelines should be reproducible, versioned, and observable.
2. **Bias & Fairness** – Early bias audits prevent downstream discrimination.
3. **Explainability** – SHAP, LIME, or domain‑specific tools bridge the gap between model predictions and stakeholder trust.
4. **Governance & Privacy** – Immutable logs, data minimization, and compliance frameworks are non‑negotiable across sectors.
5. **Operational Integration** – Models must fit seamlessly into existing workflows; otherwise, the analytical value is lost.
### 5.2 Final Thought
Analytics in action is not a set of isolated models; it is a holistic ecosystem that intertwines data, people, processes, and policy. By embedding responsibility at every stage—ethical design, transparent evaluation, and robust deployment—we ensure that the transformative power of data serves both business objectives and societal good.
---
> *“The best analytics practice is to ask not only what the data says, but how it can be trusted.”* –墨羽行