聊天視窗

Data Science for the Modern Analyst: From Concepts to Implementation - 第 4 章

Chapter 4: Building Scalable Machine‑Learning Pipelines

發布於 2026-02-26 06:13

# Chapter 4 ## Building Scalable Machine‑Learning Pipelines When Lin stared at the flood of click‑stream data spilling from the company’s internal log system, she realised that raw numbers alone were useless without a well‑architected pipeline to transform them into insights. This chapter walks through the principles, patterns, and tools that let us turn messy data into reliable, repeatable models that can scale from a handful of customers to millions. --- ### 4.1 The Blueprint A pipeline is a **chain of tasks** that starts with data ingestion and ends with a live model serving endpoint. It is built to: 1. **Handle volume and velocity** – ingest data in batch or stream. 2. **Guarantee quality** – enforce validation, deduplication, and anomaly detection. 3. **Maintain reproducibility** – record every transformation and model state. 4. **Support evolution** – let new features, models, or business rules be added with minimal friction. 5. **Operate in production** – monitor latency, accuracy, and resource usage. Lin’s goal is to keep this chain **modular**. A broken link in one module should not bring down the whole system. Think of it as a set of Lego blocks that can be swapped, upgraded, or patched independently. --- ### 4.2 Modular Design with Micro‑Services Rather than a monolith, Lin adopts a **micro‑service** architecture where each stage of the pipeline is a containerized process: | Stage | Responsibility | Typical Tech Stack | |-------|----------------|--------------------| | Ingestion | Pull from Kafka / S3 | Apache Flink / Spark Structured Streaming | | ETL | Clean, enrich, store | Airflow DAGs, dbt models | | Feature Store | Serve static and streaming features | Feast, Redshift / BigQuery | | Model Training | Train & tune models | scikit‑learn, XGBoost, TensorFlow | | Model Registry | Version, test, promote | MLflow, Weights & Biases | | Serving | Predict in real‑time | TensorFlow Serving, TorchServe | | Monitoring | Metrics, alerts | Prometheus, Grafana | This separation allows Lin to iterate on feature engineering without waiting for the training job to finish, or to roll back a model without disrupting the ingestion layer. --- ### 4.3 Data Ingestion: Batch & Stream Lin faces two types of data: **historical sales** stored in nightly CSV dumps and **real‑time click events** arriving on Kafka. Her ingestion strategy is: ```python # batch_ingest.py (Airflow DAG) from airflow import DAG from airflow.operators.bash import BashOperator from datetime import datetime with DAG('batch_ingest', start_date=datetime(2024, 1, 1), schedule_interval='@daily') as dag: download = BashOperator( task_id='download_csv', bash_command='aws s3 cp s3://sales/2024/*.csv ./data/' ) transform = BashOperator( task_id='transform', bash_command='python scripts/transform_batch.py' ) download >> transform ``` ```python # stream_ingest.py (Flink job) from pyflink.datastream import StreamExecutionEnvironment from pyflink.datastream.connectors.kafka import KafkaSource, KafkaSink env = StreamExecutionEnvironment.get_execution_environment() source = KafkaSource(topic='clicks', properties={"bootstrap.servers": "kafka:9092"}) source_stream = env.from_source(source, serialization_schema=..., output_type=...) source_stream.add_sink(KafkaSink(topic='cleaned_clicks', ...)) env.execute("clean_clicks") ``` By decoupling ingestion from downstream transforms, Lin can scale each component independently. --- ### 4.4 Feature Engineering & Feature Store Lin’s models need both **static** attributes (customer age, gender) and **dynamic** aggregates (last 7‑day purchase frequency). She stores features in a **Feast** feature store so that training and serving consume the same data. ```python # feast_feature.py from feast import Entity, Feature, FeatureStore customer = Entity( name="customer", value_type="string", description="Customer ID" ) customer_features = FeatureStore().apply([ Feature(name="age", dtype="int64", description="Customer age"), Feature(name="gender", dtype="string", description="Customer gender"), Feature(name="recent_purchase_cnt", dtype="int64", description="# purchases in last 7 days") ]) ``` Now the training script simply pulls features by key, eliminating manual joins. --- ### 4.5 Model Training: From Prototyping to Production Lin starts with a quick Jupyter notebook, experimenting with a gradient‑boosted tree. She then moves the best model into a reproducible training pipeline using **MLflow**: ```python # train.py import mlflow import mlflow.sklearn from sklearn.ensemble import XGBClassifier from sklearn.metrics import roc_auc_score mlflow.set_experiment("promo_response") with mlflow.start_run(): X_train, X_test, y_train, y_test = load_data() model = XGBClassifier(n_estimators=300, learning_rate=0.05) model.fit(X_train, y_train) pred = model.predict_proba(X_test)[:, 1] auc = roc_auc_score(y_test, pred) mlflow.log_metric("auc", auc) mlflow.sklearn.log_model(model, "model") ``` The MLflow run is automatically stored in a central registry, capturing hyperparameters, code versions, and artifacts. Lin can later use **mlflow models** to export the model in a format that is deployable to any serving platform. --- ### 4.6 Model Validation & Governance Validation is more than a test split. Lin uses a **validation pipeline** that: 1. **Implements A/B tests** on a fraction of traffic. 2. **Monitors concept drift** with the *Alibi Detect* library. 3. **Tracks data lineage** via *Great Expectations*. 4. **Audit logs** every request with *OpenTelemetry*. ```python # drift_detection.py from alibi_detect.cd import ConceptDriftDriftDetector import numpy as np drift_detector = ConceptDriftDriftDetector(reference_data=np.random.randn(1000, 10)) # In real-time new_batch = get_live_features() result = drift_detector.predict(new_batch) if result["drift"]: alert("Concept drift detected") ``` These checks ensure that Lin’s models remain trustworthy as the data ecosystem evolves. --- ### 4.7 Model Serving & Continuous Delivery Deploying a model in a production environment demands low latency and high availability. Lin chooses **TensorFlow Serving** wrapped in a **Kubernetes** pod, orchestrated by **ArgoCD** for Git‑driven continuous delivery. ```yaml # tf-serving-deployment.yaml apiVersion: apps/v1 kind: Deployment metadata: name: promo-model spec: replicas: 3 selector: matchLabels: app: promo-model template: metadata: labels: app: promo-model spec: containers: - name: serving image: tensorflow/serving:2.8.0 args: ["--model_name=promo_response", "--model_base_path=/models/promo_response"] ports: - containerPort: 8500 ``` ArgoCD watches the Git repository for new *model.yaml* files and automatically updates the deployment. This approach guarantees that the latest validated model is always in production without manual intervention. --- ### 4.8 Monitoring, Logging, and Alerting Lin integrates **Prometheus** for metrics, **Grafana** for dashboards, and **PagerDuty** for alerts. Key metrics include: - Prediction latency per request. - Queue depth for incoming events. - Model accuracy over time. - CPU/GPU utilization. An anomaly in prediction latency triggers an alert, and Lin can spin up an additional replica on demand. --- ### 4.9 Ethical Governance & Data Privacy With GDPR and CCPA in mind, Lin ensures: 1. **Data minimisation** – only the fields required for a model are stored. 2. **Pseudonymisation** – customer IDs are hashed. 3. **Explainability** – SHAP values are surfaced to the product team. 4. **Audit trails** – every model version and its deployment history is recorded. Ethical compliance is baked into the pipeline, not an afterthought. --- ### 4.10 Takeaway - A scalable pipeline is a **composition of small, well‑tested services**. - **Version control** (for code, data, and models) is non‑negotiable for reproducibility. - **Feature stores** unify training and serving data, eliminating drift. - **Continuous delivery** ensures that validated models reach production quickly. - **Monitoring** is the lifeline that keeps the system healthy and trustworthy. - **Ethical governance** must be integrated at every stage, from ingestion to serving. In the next chapter, we will focus on how to translate these robust pipelines into **actionable insights** that can be embedded directly into business dashboards and decision‑making workflows.