Chapter 8: From Models to Production – Deploying, Orchestrating, and Monitoring

發布於 2026-02-23 17:39

# Chapter 8 ## From Models to Production – Deploying, Orchestrating, and Monitoring In the previous chapter we trained a lightweight CNN, exposed it through a REST endpoint, and audited its fairness. The next logical step is to embed that service into a robust production stack that can ingest streaming data, schedule periodic retraining, and alert analysts when the model drifts. This chapter walks through the practicalities of turning a research‑grade model into a production‑ready microservice, tying it into a Spark pipeline, orchestrating it with Airflow, and visualizing metrics with Prometheus + Grafana. --- ### 1. Model Packaging & Registry 1. **Export the model** in a portable format. bash python -c " import joblib from tensorflow.keras.models import load_model model = load_model('model.h5') joblib.dump(model, 'model.joblib') " 2. **Register** the artifact in a model registry (MLflow, SageMaker Model Store, or a simple S3 bucket). Versioning keeps the audit trail. bash mlflow models serve -m ./model.joblib -p 8001 > *Why a registry?* It decouples training from serving, allowing multiple teams to reuse the same model while tracking lineage. ### 2. Containerizing the Service FastAPI + Uvicorn is a minimal, high‑performance choice for Python services. Wrap the inference logic in a Docker container. Dockerfile FROM python:3.11-slim WORKDIR /app COPY requirements.txt . RUN pip install --no-cache-dir -r requirements.txt COPY model.joblib . COPY main.py . EXPOSE 80 CMD ["uvicorn", "main:app", "--host", "0.0.0.0", "--port", "80"] `main.py`: python from fastapi import FastAPI import joblib import numpy as np app = FastAPI() model = joblib.load('model.joblib') @app.post('/predict') async def predict(payload: dict): features = np.array(payload['features']).reshape(1, -1) prediction = model.predict(features).tolist() return {'prediction': prediction} Build and push: bash docker build -t mlservice:latest . docker tag mlservice:latest registry.company.com/mlservice:latest docker push registry.company.com/mlservice:latest --- ### 3. Orchestrating with Airflow Airflow provides a DAG‑based scheduler that can trigger data pipelines, model retraining, and deployment steps. python # dags/feature_ingest_and_retrain.py from airflow import DAG from airflow.providers.apache.spark.operators.spark_submit import SparkSubmitOperator from airflow.operators.bash import BashOperator from datetime import datetime, timedelta default_args = { 'owner': 'data-team', 'retries': 2, 'retry_delay': timedelta(minutes=5), } with DAG( 'feature_ingest_and_retrain', default_args=default_args, schedule_interval='@daily', start_date=datetime(2026, 1, 1), catchup=False, ) as dag: ingest = SparkSubmitOperator( task_id='ingest', application='s3://data-pipeline/ingest.py', name='feature_ingest', ) retrain = BashOperator( task_id='retrain', bash_command='python train.py --config s3://configs/model.yaml', ) deploy = BashOperator( task_id='deploy', bash_command='kubectl set image deployment/mlservice-deploy mlservice=registry.company.com/mlservice:latest', ) ingest >> retrain >> deploy Key points: - **Idempotent tasks**: Each Airflow task should be repeatable. - **Trigger rules**: Use `all_success` to ensure downstream steps only run if the preceding step succeeded. - **Observability**: Airflow UI shows task logs and DAG lineage. --- ### 4. Scaling with Kubernetes Deploy the FastAPI container into a Kubernetes cluster. yaml # mlservice-deployment.yaml apiVersion: apps/v1 kind: Deployment metadata: name: mlservice-deploy spec: replicas: 3 selector: matchLabels: app: mlservice template: metadata: labels: app: mlservice spec: containers: - name: mlservice image: registry.company.com/mlservice:latest ports: - containerPort: 80 Expose via a LoadBalancer: yaml apiVersion: v1 kind: Service metadata: name: mlservice-service spec: type: LoadBalancer selector: app: mlservice ports: - protocol: TCP port: 80 targetPort: 80 Kubernetes autoscaling (HPA) can keep latency low as traffic spikes: bash kubectl autoscale deployment mlservice-deploy --cpu-percent=80 --min=2 --max=10 --- ### 5. Observability with Prometheus & Grafana Instrument the FastAPI app with metrics. python # main.py (additions) from prometheus_client import Summary, Counter, make_asgi_app from starlette.middleware import Middleware from starlette.middleware.base import BaseHTTPMiddleware REQUEST_TIME = Summary('request_processing_seconds', 'Time spent processing request') REQUEST_COUNT = Counter('request_count', 'Number of requests') class MetricsMiddleware(BaseHTTPMiddleware): async def dispatch(self, request, call_next): REQUEST_COUNT.inc() with REQUEST_TIME.time(): response = await call_next(request) return response app = FastAPI(middleware=[Middleware(MetricsMiddleware)]) app.mount('/metrics', make_asgi_app()) Prometheus scrape config (`prometheus.yml`): yaml scrape_configs: - job_name: 'mlservice' static_configs: - targets: ['mlservice-service:80'] Grafana dashboards: - **Latency** (Histogram from `request_processing_seconds`) - **Throughput** (`request_count`) - **Error Rate** (HTTP 4xx/5xx responses) - **CPU & Memory** (Kubernetes metrics) --- ### 6. Governance & Compliance - **Model versioning**: Keep a signed JSON metadata file alongside the model containing hash, training date, feature set, and target. - **Bias monitoring**: Deploy a sidecar that compares predictions against demographic labels, updating a Prometheus gauge (`bias_metric`). - **Security**: Use mutual TLS between services, enforce API keys, and enable role‑based access control in Airflow. - **Data retention**: Archive feature store snapshots to cold storage after a retention period; delete old Airflow logs. --- ### 7. Continuous Feedback Loop 1. **Retrain trigger**: Airflow DAG monitors validation metrics; if `validation_accuracy` drops below 0.85, schedule a retraining DAG. 2. **A/B testing**: Spin up a parallel deployment (`mlservice-ab`) and route 10% of traffic to it. Compare metrics via Prometheus. 3. **Canary release**: Gradually shift traffic to a new model version while monitoring latency and error rates. --- ### 8. Putting It All Together – A Mini‑Case Study > **Scenario:** A fintech company predicts loan default risk. > > **Data pipeline:** Spark pulls credit history from a data lake, cleans it, and writes a 2‑D image‑like feature matrix to S3. > > **Model:** A CNN trained on these matrices. > > **Serving:** FastAPI container deployed to EKS, exposed via ALB. > > **Orchestration:** Airflow DAG ingests daily updates, retrains weekly, and rolls out new containers. > > **Observability:** Prometheus scrapes metrics; Grafana alerts on 5xx spikes. > > **Governance:** Every model version is stored in an S3 bucket with a signed manifest; bias metrics are visualized. > > **Result:** The production pipeline achieved 92% accuracy, reduced latency to < 200 ms, and allowed data scientists to iterate in under 48 hours. --- ### 9. Take‑aways - **Decouple training and serving**: A registry and containerization make swapping versions painless. - **Infrastructure as code**: Use Kubernetes YAML, Helm charts, and Airflow DAGs for reproducibility. - **Observability is non‑negotiable**: Metrics, logs, and alerts are the eyes that catch problems before users notice. - **Governance and compliance**: Model lineage, bias monitoring, and security safeguards protect the organization. - **Continuous learning**: Automation of retraining and deployment creates a feedback loop that keeps the model relevant. In the next chapter we will dive into advanced scaling strategies—model serving at the edge, multi‑region deployments, and serverless inference—so that you can tackle the most demanding real‑world scenarios.

Chapter 7: Advanced Topics – Deep Learning & NLP

Chapter 9: Scaling the Deployment—Edge, Multi‑Region, and Serverless Inference