第八章：部署、監控與 MLOps

發布於 2026-02-25 06:44

# 第八章：部署、監控與 MLOps 本章將帶領讀者從模型封裝到真正落地，涵蓋 API 部署、模型監控、再訓練流程，以及完整的 MLOps 流程設計。透過實際範例與工具說明，讓您能夠在真實環境中維持模型效能、快速迭代並確保合規與安全。 ## 8.1 模型封裝（Model Packaging） | 步驟 | 目的 | 工具 | 範例程式碼 | |------|------|------|-------------| | 1. 模型儲存 | 把訓練好的模型序列化 | `pickle`, `joblib`, `ONNX`, `TensorFlow SavedModel` | python import joblib joblib.dump(model, 'model.joblib') | | 2. 依賴管理 | 確保執行環境一致 | `conda env`, `pipenv`, `poetry` | yaml name: ml-env dependencies: - python=3.10 - pandas - scikit-learn | | 3. 版本控制 | 對模型版本進行追蹤 | `MLflow`, `DVC` | python import mlflow mlflow.sklearn.log_model(model, "model") | > **注意**：模型文件大小、序列化格式會直接影響部署時的延遲與資源佔用。對於深度模型，建議使用 ONNX 或 TensorRT 進行加速。 ## 8.2 API 部署（API Deployment） ### 8.2.1 選擇 Web 框架 - **FastAPI**：Python 3.6+，非同步、OpenAPI 支援，適合快速原型。 - **Flask**：輕量級，社群資源豐富。 - **FastAPI + uvicorn** 範例： python # main.py from fastapi import FastAPI, HTTPException import joblib import pandas as pd app = FastAPI(title="Credit Scoring API") model = joblib.load("model.joblib") @app.post("/predict") async def predict(payload: dict): try: df = pd.DataFrame([payload]) prob = model.predict_proba(df)[:, 1].tolist() return {"score": prob[0]} except Exception as e: raise HTTPException(status_code=400, detail=str(e)) ### 8.2.2 容器化（Docker） dockerfile # Dockerfile FROM python:3.10-slim WORKDIR /app COPY requirements.txt . RUN pip install -r requirements.txt COPY . . EXPOSE 8000 CMD ["uvicorn", "main:app", "--host", "0.0.0.0", "--port", "8000"] ### 8.2.3 部署到雲端 | 平台 | 特色 | 典型使用場景 | |------|------|---------------| | AWS ECS/Fargate | 無伺服器容器，易於擴展 | 大型批次推論 | | GKE / EKS | Kubernetes 原生，支援自動擴容 | 需要多模型、混合工作負載 | | Azure Container Apps | Serverless + 事件觸發 | IoT、Edge 連接 | > **小技巧**：使用 **Health Check** 與 **Readiness Probe** 保障服務可用性。 ## 8.3 監控指標（Monitoring Metrics） | 指標類別 | 監控項目 | 目的 | |----------|----------|------| | 服務可用性 | Request Latency, Error Rate, Uptime | 保持 SLA | | 資料漂移 | Input Feature Distribution, Population Stability Index (PSI) | 檢測數據分布變化 | | 模型漂移 | Prediction Drift, Concept Drift Index | 檢測模型效能下滑 | | 資源使用 | CPU, Memory, GPU Utilization | 成本管理 | ### 8.3.1 監控工具 - **Prometheus + Grafana**：收集指標、視覺化。 - **KubePrometheus**：K8s 原生監控。 - **ELK Stack**：日誌聚合、搜尋。 - **MLflow Tracking**：模型實驗與部署日誌。 ### 8.3.2 例子：Prometheus Exporter python # metrics_exporter.py from prometheus_client import start_http_server, Summary, Gauge import time REQUEST_TIME = Summary("request_processing_seconds", "Time spent processing request") MODEL_SCORE = Gauge("model_score", "Model prediction score", ['instance']) @REQUEST_TIME.time() def process_request(x): # 模擬計算 time.sleep(0.1) return 0.9 if __name__ == "__main__": start_http_server(8001) while True: score = process_request(1) MODEL_SCORE.labels(instance="demo").set(score) time.sleep(5) ## 8.4 模型再訓練（Retraining） 1. **資料漂移檢測**：每日/每週比較 PSI；若 PSI > 0.1，觸發訓練。 2. **自動化腳本**：使用 Airflow / Prefect 定義 DAG。 3. **模型版本比較**：在重新訓練前，用交叉驗證或 Hold‑out 測試舊版 vs 新版。 4. **灰度發布**：先部署至 5% traffic，監測後再全面切換。 python # retrain_dag.py (Airflow) from airflow import DAG from airflow.operators.python import PythonOperator from datetime import datetime with DAG("retrain_pipeline", start_date=datetime(2024, 1, 1), schedule_interval='@daily') as dag: def check_data_drift(): # 讀取 PSI psi = 0.12 if psi > 0.1: return True return False def retrain(): # 重新訓練並上傳至 MLflow pass t_check = PythonOperator(task_id="check_drift", python_callable=check_data_drift) t_retrain = PythonOperator(task_id="retrain_model", python_callable=retrain) t_check >> t_retrain ## 8.5 MLOps 流程設計（MLOps Pipeline） ┌───────────────────────┐ │ 1. 版本化資料 & 代碼 │ └───────┬────────────────┘ │ │ 2. CI：自動測試 & 靜態分析 │ │ ├─ pytest ──┐ │ │ └─ flake8 ──┘ │ │ │ │ 3. 训练与验证 │ │ ├─ Data Prep │ │ ├─ Feature Store │ │ ├─ Training Job │ │ └─ Validation │ │ │ │ 4. 部署与发布 │ │ ├─ Docker Build │ │ ├─ Kubernetes Deploy │ │ └─ Canary Release │ │ │ │ 5. 监控 & 反馈 │ │ ├─ Prometheus/Grafana │ │ ├─ Data Drift │ │ └─ Retraining Trigger │ └───────────────────────┘ ### 8.5.1 关键组成 | 组件 | 责任 | |------|------| | 代码仓库 | 存放模型、服務、測試腳本 | | 数据仓库 | 存放训练集、验证集、测试集 | | CI | 自动化单元测试、lint、构建镜像 | | CD | 自动化部署、灰度发布 | | 监控 | 数据漂移、模型漂移、服务指标 | | 反馈 | retrain triggers, alerting | > **最佳实践**：采用 **Infra as Code**（Terraform / Pulumi）管理基础设施，保证可重复性。 ## 8.6 常見工具與平台（Tools & Platforms） | 目的 | 產品 | 典型使用情景 | |------|------|---------------| | 版本控制 | **Git** + **DVC** | 存儲代码、特征工程、模型 | | 实验跟踪 | **MLflow**, **Weights & Biases** | 对比不同实验 | | 训练调度 | **Airflow**, **Kubeflow Pipelines**, **Prefect** | 编排数据准备、训练、评估 | | 模型管理 | **ModelDB**, **TensorFlow Serving**, **TorchServe** | 在线推理 | | 容器与编排 | **Docker**, **Kubernetes**, **Istio** | 高可用、微服务 | | 监控与告警 | **Prometheus**, **Grafana**, **Alertmanager** | 指标收集 | | 数据漂移 | **Alibi Detect**, **DataRobot Drift** | 自动触发 retrain | ## 8.6 安全與合規（Security & Governance） - **模型保密**：使用 GCP KMS 或 AWS KMS 加密模型文件。 - **審計日誌**：所有请求、训练、部署都记录至 MLflow Tracking，符合 GDPR/ISO 27001。 - **權限控制**：RBAC + OAuth2 在 API Gateway 上限制访问。 - **審計合規**：使用 **Databricks Unity Catalog** 或 **Azure Purview** 追踪数据与模型元数据。 ## 8.7 小結 - **模型封裝**：確保序列化、依賴管理、版本追蹤。 - **API 部署**：FastAPI + Docker + 雲容器，提供高可用、可擴容的推論服務。 - **監控**：從服務指標到資料漂移、模型漂移，利用 Prometheus/Grafana 或 MLflow。 - **再訓練**：自動化資料漂移檢測、Airflow DAG、灰度發布。 - **MLOps 流程**：結合 CI/CD、Infra as Code、監控、回饋，形成可重複、可擴展的機器學習生命週期。 > **實務提醒**：MLOps 不是一蹴而就的；從小規模實驗開始，逐步加入 CI、監控與自動化，才能在大規模生產環境中穩定運營。

第七章時序模型與因果推斷：把波動轉化為策略

第九章數據治理與倫理：構建可信任的決策生態