Chapter 9 – From Prototype to Production: Deploying Data‑Science Models at Scale

發布於 2026-03-06 22:26

# 9.1 From Notebook to Service In the last chapter we learned to orchestrate experiments as *first‑class artifacts*—data, code, hyper‑parameters, logs—all versioned in a Git‑style repository. The next logical step is to expose the *best* of those experiments to the world: a model that runs, scales, and serves predictions on demand. ## 9.1.1 Why Deployment Matters > **Deploying a model is not a one‑off task; it is a continuous journey.** A model that once delivered 95 % accuracy in a sandbox may degrade to 80 % once the production data distribution shifts. Thus, deployment is a gateway to **monitoring**, **re‑training**, and **governance**. ### 1. Docker: The Packaging Unit Docker provides a declarative way to capture all the runtime dependencies of your model. ```dockerfile # Dockerfile FROM python:3.11-slim WORKDIR /app COPY requirements.txt . RUN pip install --no-cache-dir -r requirements.txt COPY . . ENV PORT=8080 EXPOSE 8080 ENTRYPOINT ["gunicorn", "app:app", "--bind", "0.0.0.0:8080"] ``` - **Benefits**: reproducibility, isolation, ease of CI/CD. - **Pitfalls**: image bloat; watch the size of the base image. ### 2. Building a Predictive Service ```python # app.py from fastapi import FastAPI, Request import joblib import numpy as np app = FastAPI() model = joblib.load("model.pkl") @app.post("/predict") async def predict(request: Request): payload = await request.json() X = np.array(payload["features"]).reshape(1, -1) prob = model.predict_proba(X)[0, 1] return {"probability": prob} ``` FastAPI gives you async request handling, auto‑generated OpenAPI docs, and a minimal learning curve. ## 9.1.2 Scaling with Kubernetes Running containers locally is fine for experimentation, but production demands **horizontal scaling**, **self‑healing**, and **observability**. ### 1. Helm Charts Using Helm lets you package all Kubernetes manifests into reusable charts. ```yaml # helm/myapp/values.yaml replicaCount: 3 image: repository: myrepo/myapp tag: "{{ .Chart.AppVersion }}" service: type: ClusterIP port: 80 resources: limits: cpu: "1" memory: "512Mi" ``` ### 2. Autoscaling Leverage **HorizontalPodAutoscaler** (HPA) to scale pods based on CPU or custom metrics (e.g., request latency). Example: ```yaml apiVersion: autoscaling/v2beta2 kind: HorizontalPodAutoscaler metadata: name: myapp-hpa spec: scaleTargetRef: apiVersion: apps/v1 kind: Deployment name: myapp minReplicas: 3 maxReplicas: 15 metrics: - type: Resource resource: name: cpu target: type: Utilization averageUtilization: 60 ``` ## 9.1.3 Observability & Monitoring - **Prometheus + Grafana**: scrape metrics from the FastAPI health endpoint. - **ELK Stack**: log routing and anomaly detection. - **OpenTelemetry**: distributed tracing across microservices. ```yaml # OpenTelemetry sidecar apiVersion: apps/v1 kind: Deployment metadata: name: myapp-opentelemetry spec: template: containers: - name: otel-collector image: otel/opentelemetry-collector:latest command: ["--config=/conf/otel-collector-config.yaml"] volumeMounts: - name: config mountPath: /conf ``` ## 9.1.4 Model Governance in Production Once your model is live, governance becomes an operational necessity: 1. **Versioning**: Tag each model with a semantic version and store its artifacts in a model registry (e.g., MLflow). 2. **A/B Testing**: Serve traffic to a new model version to a subset of users and compare metrics. 3. **Feature Drift Monitoring**: Compare the distribution of incoming features against the training data. 4. **Explainability Dashboards**: Deploy SHAP or LIME visualizations for end‑users to interrogate predictions. ## 9.1.5 Continuous Delivery Pipeline 1. **Git Push** → **CI Build** (lint, unit tests, integration tests). 2. **Docker Build** → **Push to Registry**. 3. **Helm Upgrade** → **Deploy to K8s**. 4. **Post‑Deployment Smoke Test** → **Monitor**. 5. **Model Retraining Trigger** when drift exceeds threshold. ## 9.1.6 Lessons Learned | Observation | Takeaway | |--------------|----------| | Cold starts on GKE were 4‑seconds on average. | Consider using a *Knative* eventing model or pre‑warm pods. | | GPU usage for inference dropped to 20 % in production. | Optimize batch size and use ONNX runtime. | | Drift detection triggered 12 re‑trains in 3 months. | Invest in automated data pipelines to keep the training set fresh. | # 9.2 The Big Picture Deploying a model is a *continuous loop*—not a checkpoint. The code you write in the notebook is only the first iteration. Production is an ecosystem where data, code, monitoring, and governance interact fluidly. In the next chapter we’ll formalize this ecosystem with **Kubeflow Pipelines** and **MLOps best practices**, turning that loop into a scalable, maintainable pipeline.

Chapter 8: Scaling the Pipeline – From Single Model to Model Ecosystem

Chapter 10: Capstone Project & Career Pathways