返回目錄
A
Data Science for the Modern Analyst: From Concepts to Implementation - 第 4 章
Chapter 4: Building Scalable Machine‑Learning Pipelines
發布於 2026-02-26 06:13
# Chapter 4
## Building Scalable Machine‑Learning Pipelines
When Lin stared at the flood of click‑stream data spilling from the company’s internal log system, she realised that raw numbers alone were useless without a well‑architected pipeline to transform them into insights. This chapter walks through the principles, patterns, and tools that let us turn messy data into reliable, repeatable models that can scale from a handful of customers to millions.
---
### 4.1 The Blueprint
A pipeline is a **chain of tasks** that starts with data ingestion and ends with a live model serving endpoint. It is built to:
1. **Handle volume and velocity** – ingest data in batch or stream.
2. **Guarantee quality** – enforce validation, deduplication, and anomaly detection.
3. **Maintain reproducibility** – record every transformation and model state.
4. **Support evolution** – let new features, models, or business rules be added with minimal friction.
5. **Operate in production** – monitor latency, accuracy, and resource usage.
Lin’s goal is to keep this chain **modular**. A broken link in one module should not bring down the whole system. Think of it as a set of Lego blocks that can be swapped, upgraded, or patched independently.
---
### 4.2 Modular Design with Micro‑Services
Rather than a monolith, Lin adopts a **micro‑service** architecture where each stage of the pipeline is a containerized process:
| Stage | Responsibility | Typical Tech Stack |
|-------|----------------|--------------------|
| Ingestion | Pull from Kafka / S3 | Apache Flink / Spark Structured Streaming |
| ETL | Clean, enrich, store | Airflow DAGs, dbt models |
| Feature Store | Serve static and streaming features | Feast, Redshift / BigQuery |
| Model Training | Train & tune models | scikit‑learn, XGBoost, TensorFlow |
| Model Registry | Version, test, promote | MLflow, Weights & Biases |
| Serving | Predict in real‑time | TensorFlow Serving, TorchServe |
| Monitoring | Metrics, alerts | Prometheus, Grafana |
This separation allows Lin to iterate on feature engineering without waiting for the training job to finish, or to roll back a model without disrupting the ingestion layer.
---
### 4.3 Data Ingestion: Batch & Stream
Lin faces two types of data: **historical sales** stored in nightly CSV dumps and **real‑time click events** arriving on Kafka. Her ingestion strategy is:
```python
# batch_ingest.py (Airflow DAG)
from airflow import DAG
from airflow.operators.bash import BashOperator
from datetime import datetime
with DAG('batch_ingest', start_date=datetime(2024, 1, 1), schedule_interval='@daily') as dag:
download = BashOperator(
task_id='download_csv',
bash_command='aws s3 cp s3://sales/2024/*.csv ./data/'
)
transform = BashOperator(
task_id='transform',
bash_command='python scripts/transform_batch.py'
)
download >> transform
```
```python
# stream_ingest.py (Flink job)
from pyflink.datastream import StreamExecutionEnvironment
from pyflink.datastream.connectors.kafka import KafkaSource, KafkaSink
env = StreamExecutionEnvironment.get_execution_environment()
source = KafkaSource(topic='clicks', properties={"bootstrap.servers": "kafka:9092"})
source_stream = env.from_source(source, serialization_schema=..., output_type=...)
source_stream.add_sink(KafkaSink(topic='cleaned_clicks', ...))
env.execute("clean_clicks")
```
By decoupling ingestion from downstream transforms, Lin can scale each component independently.
---
### 4.4 Feature Engineering & Feature Store
Lin’s models need both **static** attributes (customer age, gender) and **dynamic** aggregates (last 7‑day purchase frequency). She stores features in a **Feast** feature store so that training and serving consume the same data.
```python
# feast_feature.py
from feast import Entity, Feature, FeatureStore
customer = Entity(
name="customer",
value_type="string",
description="Customer ID"
)
customer_features = FeatureStore().apply([
Feature(name="age", dtype="int64", description="Customer age"),
Feature(name="gender", dtype="string", description="Customer gender"),
Feature(name="recent_purchase_cnt", dtype="int64", description="# purchases in last 7 days")
])
```
Now the training script simply pulls features by key, eliminating manual joins.
---
### 4.5 Model Training: From Prototyping to Production
Lin starts with a quick Jupyter notebook, experimenting with a gradient‑boosted tree. She then moves the best model into a reproducible training pipeline using **MLflow**:
```python
# train.py
import mlflow
import mlflow.sklearn
from sklearn.ensemble import XGBClassifier
from sklearn.metrics import roc_auc_score
mlflow.set_experiment("promo_response")
with mlflow.start_run():
X_train, X_test, y_train, y_test = load_data()
model = XGBClassifier(n_estimators=300, learning_rate=0.05)
model.fit(X_train, y_train)
pred = model.predict_proba(X_test)[:, 1]
auc = roc_auc_score(y_test, pred)
mlflow.log_metric("auc", auc)
mlflow.sklearn.log_model(model, "model")
```
The MLflow run is automatically stored in a central registry, capturing hyperparameters, code versions, and artifacts. Lin can later use **mlflow models** to export the model in a format that is deployable to any serving platform.
---
### 4.6 Model Validation & Governance
Validation is more than a test split. Lin uses a **validation pipeline** that:
1. **Implements A/B tests** on a fraction of traffic.
2. **Monitors concept drift** with the *Alibi Detect* library.
3. **Tracks data lineage** via *Great Expectations*.
4. **Audit logs** every request with *OpenTelemetry*.
```python
# drift_detection.py
from alibi_detect.cd import ConceptDriftDriftDetector
import numpy as np
drift_detector = ConceptDriftDriftDetector(reference_data=np.random.randn(1000, 10))
# In real-time
new_batch = get_live_features()
result = drift_detector.predict(new_batch)
if result["drift"]:
alert("Concept drift detected")
```
These checks ensure that Lin’s models remain trustworthy as the data ecosystem evolves.
---
### 4.7 Model Serving & Continuous Delivery
Deploying a model in a production environment demands low latency and high availability. Lin chooses **TensorFlow Serving** wrapped in a **Kubernetes** pod, orchestrated by **ArgoCD** for Git‑driven continuous delivery.
```yaml
# tf-serving-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: promo-model
spec:
replicas: 3
selector:
matchLabels:
app: promo-model
template:
metadata:
labels:
app: promo-model
spec:
containers:
- name: serving
image: tensorflow/serving:2.8.0
args: ["--model_name=promo_response", "--model_base_path=/models/promo_response"]
ports:
- containerPort: 8500
```
ArgoCD watches the Git repository for new *model.yaml* files and automatically updates the deployment. This approach guarantees that the latest validated model is always in production without manual intervention.
---
### 4.8 Monitoring, Logging, and Alerting
Lin integrates **Prometheus** for metrics, **Grafana** for dashboards, and **PagerDuty** for alerts. Key metrics include:
- Prediction latency per request.
- Queue depth for incoming events.
- Model accuracy over time.
- CPU/GPU utilization.
An anomaly in prediction latency triggers an alert, and Lin can spin up an additional replica on demand.
---
### 4.9 Ethical Governance & Data Privacy
With GDPR and CCPA in mind, Lin ensures:
1. **Data minimisation** – only the fields required for a model are stored.
2. **Pseudonymisation** – customer IDs are hashed.
3. **Explainability** – SHAP values are surfaced to the product team.
4. **Audit trails** – every model version and its deployment history is recorded.
Ethical compliance is baked into the pipeline, not an afterthought.
---
### 4.10 Takeaway
- A scalable pipeline is a **composition of small, well‑tested services**.
- **Version control** (for code, data, and models) is non‑negotiable for reproducibility.
- **Feature stores** unify training and serving data, eliminating drift.
- **Continuous delivery** ensures that validated models reach production quickly.
- **Monitoring** is the lifeline that keeps the system healthy and trustworthy.
- **Ethical governance** must be integrated at every stage, from ingestion to serving.
In the next chapter, we will focus on how to translate these robust pipelines into **actionable insights** that can be embedded directly into business dashboards and decision‑making workflows.