Chapter 6: Deployment & Production

發布於 2026-03-04 16:15

# Chapter 6: Deployment & Production > **Take‑away** – Turning a trained model into a reliable, scalable, and maintainable production asset requires more than just code. It demands disciplined practices around containerization, continuous integration, API design, monitoring, and cloud‑native scalability. ## 6.1 Introduction After mastering data pipelines, statistical modeling, and machine‑learning workflows, the final leg of a data scientist’s journey is deploying models into the production environment where business users can rely on them. * **Why production matters** – A model that performs well in a notebook but crashes on a web request creates mistrust and operational risk. * **Key pillars** – 1. **Containerization** – Encapsulate the model, dependencies, and runtime. 2. **CI/CD** – Automate testing, building, and deployment. 3. **API Design** – Expose inference as a stateless RESTful service. 4. **Observability** – Monitor latency, error rates, and model drift. 5. **Scaling** – Auto‑scale based on demand and resource constraints. We build on the *Container Instances* chapter, extending those concepts to end‑to‑end pipelines and cloud‑native orchestration. ## 6.2 Containerization with Docker Containers provide a reproducible runtime environment that isolates your model from host OS variations. For ML models, multi‑stage Docker builds help keep images lean. ### 6.2.1 Writing a Dockerfile dockerfile # Stage 1 – Build the model FROM python:3.10-slim AS builder WORKDIR /app # Install build‑time dependencies RUN pip install --no-cache-dir -r requirements.txt # Copy source and train model COPY . . RUN python train.py # Produces model.pkl # Stage 2 – Runtime image FROM python:3.10-slim WORKDIR /app # Install runtime dependencies RUN pip install --no-cache-dir flask # Copy model artifacts from builder COPY --from=builder /app/model.pkl . COPY --from=builder /app/api.py . EXPOSE 8080 CMD ["python", "api.py"] *Use `--no-cache-dir` to reduce image size, and separate stages to avoid bundling training dependencies.* ### 6.2.2 Best Practices | Practice | Rationale | |---|---| | **Pin versions** – `python:3.10.12` and explicit `pip install flask==2.0.3` | Reproducibility across environments | | **Minimize layers** – Combine `RUN` commands | Smaller image size | | **Add `.dockerignore`** – Exclude `__pycache__`, `data/`, `tests/` | Faster builds | | **Use `HEALTHCHECK`** – Verify API is up | Self‑diagnostics | ## 6.3 CI/CD for Machine‑Learning Models Continuous Integration and Continuous Deployment pipelines automate the entire journey from code commit to live inference service. ### 6.3.1 Typical Pipeline Stages | Stage | What Happens | Tools | |---|---|---| | **Lint** | Code style check (`flake8`, `black`) | GitHub Actions | | **Test** | Unit and integration tests | `pytest` | | **Build** | Docker image creation | Docker CLI | | **Push** | Push image to registry | Docker Hub, GitHub Container Registry, ECR | | **Deploy** | Spin up container in target environment | Kubernetes, ECS, GKE | | **Smoke Test** | Verify endpoint responds | `curl` or `httpie` | ### 6.3.2 Example GitHub Actions Workflow yaml name: ML Deploy on: push: branches: [ main ] jobs: build: runs-on: ubuntu-latest steps: - uses: actions/checkout@v3 - name: Set up Python uses: actions/setup-python@v4 with: python-version: '3.10' - name: Install dependencies run: pip install -r requirements.txt - name: Run tests run: pytest tests/ - name: Build Docker image run: docker build -t myorg/ml-api:${{ github.sha }} . - name: Log in to Docker Hub uses: docker/login-action@v2 with: username: ${{ secrets.DOCKERHUB_USER }} password: ${{ secrets.DOCKERHUB_TOKEN }} - name: Push image run: docker push myorg/ml-api:${{ github.sha }} - name: Deploy to Kubernetes uses: azure/k8s-deploy@v2 with: manifests: k8s/deployment.yaml images: myorg/ml-api:${{ github.sha }} ### 6.3.3 Model & Data Versioning | Tool | Purpose | |---|---| | **DVC** | Track datasets, model artifacts, and pipelines | | **MLflow** | Model registry, experiment tracking | | **Git LFS** | Store large binary files in git | ## 6.4 RESTful API Design for Inference Expose your model through a stateless HTTP API. FastAPI is a modern, high‑performance choice. ### 6.4.1 FastAPI Skeleton python # api.py from fastapi import FastAPI, HTTPException from pydantic import BaseModel import pickle import uvicorn app = FastAPI(title="Ticket Frequency Predictor") # Load the model at startup with open("model.pkl", "rb") as f: model = pickle.load(f) class TicketRequest(BaseModel): user_id: int time_of_day: str # e.g., "14:30" device: str class TicketResponse(BaseModel): predicted_frequency: float @app.post("/predict", response_model=TicketResponse) async def predict(request: TicketRequest): try: features = [request.user_id, request.time_of_day, request.device] prob = model.predict_proba([features])[0][1] return TicketResponse(predicted_frequency=prob) except Exception as e: raise HTTPException(status_code=500, detail=str(e)) if __name__ == "__main__": uvicorn.run(app, host="0.0.0.0", port=8080) ### 6.4.2 Design Principles | Principle | Why It Matters | |---|---| | **Statelessness** | Enables horizontal scaling | | **Schema validation** | Prevents malformed requests | | **Versioned endpoints** (`/v1/predict`) | Allows backward compatibility | | **Rate limiting** | Protects against DoS | ## 6.5 Monitoring, Logging, and Observability A production model is only useful if you can see how it behaves over time. ### 6.5.1 Metrics Collection | Metric | Typical Unit | Alerting | |---|---|---| | Latency | ms | > 500 ms for 95th percentile | | Error rate | % | > 1 % | | CPU usage | % | > 80 % | | Memory usage | MB | > 70 % | | **Model drift** – mean shift in `Ticket Frequency` | – | > 15 % shift | Use **Prometheus** to scrape metrics from `/metrics` endpoint exposed by FastAPI (via `prometheus_fastapi_instrumentator`). python from prometheus_fastapi_instrumentator import Instrumentator instrumentator = Instrumentator() instrumentator.instrument(app).expose(app, should_gzip=True) ### 6.5.2 Logging * **Structured logs** – JSON format with `timestamp`, `level`, `request_id`, `endpoint`. * **Log rotation** – Use `logrotate` or container‑side logging drivers. * **Centralized aggregation** – ELK stack (ElasticSearch, Logstash, Kibana) or **ECS CloudWatch Logs**. python import logging logger = logging.getLogger("ml_api") logger.setLevel(logging.INFO) @app.middleware("http") async def log_requests(request: Request, call_next): logger.info({"method": request.method, "url": str(request.url)}) response = await call_next(request) return response ### 6.5.3 Alerting Integrate **Alertmanager** with Prometheus. Define alert rules for latency spikes, error rates, and drift detection. Push alerts to Slack, PagerDuty, or Teams. yaml - alert: HighLatency expr: histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le)) > 0.5 for: 5m labels: severity: warning annotations: summary: "95th percentile latency > 500 ms" ## 6.6 Performance Scaling on Cloud Platforms ### 6.6.1 Container Orchestration | Platform | Key Features | |---|---| | **Kubernetes** (AKS, EKS, GKE) | Autoscaling, self‑healing, custom resource definitions | | **Azure Container Instances** | Serverless containers, no VM management | | **AWS Fargate** | Managed compute for ECS/EKS | | **Google Cloud Run** | Fully managed, HTTP‑based scaling | ### 6.6.2 Horizontal Pod Autoscaler (HPA) Configure HPA to spin up additional pods based on CPU or custom metrics (e.g., request latency). yaml apiVersion: autoscaling/v2 kind: HorizontalPodAutoscaler metadata: name: ml-api-hpa spec: scaleTargetRef: apiVersion: apps/v1 kind: Deployment name: ml-api minReplicas: 2 maxReplicas: 10 metrics: - type: Resource resource: name: cpu target: type: Utilization averageUtilization: 70 ### 6.6.3 Serverless Inference For bursty workloads, consider **AWS Lambda** (via `AWS Lambda Layers` for dependencies) or **Google Cloud Functions**. Wrap the model in a lightweight wrapper to keep cold‑start latency minimal. ### 6.6.4 Edge Deployment When latency budgets are tight, deploy the model to edge devices using tools like **TensorRT**, **ONNX Runtime**, or **OpenVINO**. Containers can run on NVIDIA Jetson or Raspberry Pi with `docker run`. ## 6.7 Security & Compliance | Area | Recommendation | |---|---| | **Secrets** | Use Azure Key Vault, AWS Secrets Manager, or HashiCorp Vault | | **API keys** | Enforce API key rotation, scope, and rate limits | | **Network policies** | Isolate services, restrict ingress/egress | | **Data encryption** | TLS for transit, AES‑256 for at‑rest data | | **Audit trails** | Log every request and deployment event | ## 6.8 Reproducibility & Versioning in Production | Practice | Tool | How It Helps | |---|---|---| | **Data Versioning** | DVC | Track data changes, enable rollbacks | | **Model Registry** | MLflow | Tag, version, and deploy specific model artifacts | | **Configuration Management** | `json`, `yaml`, or `python` dicts | Keep environment, hyperparameters, and thresholds in code | | **Artifact Store** | S3, GCS, Azure Blob | Centralize binaries, logs, and metrics | ## 6.9 Checklist Before Going Live | Item | Done | |---|---| | **Unit & integration tests** | ☐ | | **Performance benchmarks** | ☐ | | **Security audit** | ☐ | | **Monitoring & alerting set up** | ☐ | | **Canary deployment strategy** | ☐ | | **Rollback plan** | ☐ | ## 6.10 Summary Deploying a machine‑learning model is a multidisciplinary exercise that blends software engineering best practices with data‑science rigor. By following a structured CI/CD pipeline, containerizing your inference service, exposing it through a clean RESTful API, and embedding observability, you turn predictive models into scalable, trustworthy business assets. **Next Chapter Preview** – *Model Monitoring & Continuous Learning*: Discover how to keep your model’s performance in check as real‑world data drifts, and automate iterative improvements without manual re‑training.

Chapter 5: Machine Learning Techniques

Chapter 7: Data Ethics & Governance