第8章：部署、監控與迭代

發布於 2026-03-05 12:01

# 第8章：部署、監控與迭代在前面幾章中，我們已經完成了從業務問題定義到模型評估的完整流程。接下來的重點是將模型交付到生產環境，並在實際運營中持續監控、測試與迭代，確保模型保持高效、可靠且合規。\n\n## 1. 為什麼部署是關鍵 - **價值落地**：模型若僅停留在研究室，無法帶來營收或運營改善。 - **一致性**：部署保證每個使用者都得到相同的預測結果，減少人為誤差。 - **可追蹤**：部署流程能與實驗管理工具（如 MLflow）整合，確保模型版本可追蹤。 - **合規與監管**：在金融、醫療等受監管領域，部署流程必須符合法規要求，並可提供審計證據。 ## 2. 部署技術概覽 | 技術 | 主要用途 | 優點 | 缺點 | |------|----------|------|------| | **MLflow** | 端到端實驗管理與模型包裝 | 易於版本控制、跨平台部署 | 需要額外的服務端部署 | | **Docker** | 容器化封裝 | 環境一致、易於擴展 | 需要學習容器概念 | | **Kubernetes** | 編排與自動擴容 | 高可用、彈性伸縮 | 需要運維知識 | | **AWS SageMaker / Azure ML / GCP Vertex AI** | 雲端托管服務 | 一鍵部署、監控 | 成本較高、鎖定雲服務 | 在本章中，我們將聚焦於 **MLflow + Docker + Kubernetes** 的組合，這是最常見且靈活的部署方案。 ## 3. MLflow 部署流程 MLflow 允許將模型打包成 **MLflow Model** 格式，並支持多種輸出（Python、Java、ONNX 等）。 ### 3.1 追蹤與打包 python import mlflow from sklearn.ensemble import RandomForestClassifier from sklearn.datasets import load_iris mlflow.set_experiment('iris_classification') with mlflow.start_run(): X, y = load_iris(return_X_y=True) model = RandomForestClassifier(n_estimators=100) model.fit(X, y) mlflow.sklearn.log_model(model, 'model') mlflow.log_metric('accuracy', model.score(X, y)) 此時，MLflow 會在本地或遠端伺服器（如 MLflow Tracking Server）中保存模型。 ### 3.2 將模型導出為 Docker 映像 bash # 1. 生成 MLflow 模型包 mlflow models serve -m runs:/<run_id>/model --no-conda # 2. 在 Dockerfile 中引用 **Dockerfile** dockerfile FROM python:3.10-slim # 安裝 MLflow 及依賴 RUN pip install mlflow==2.5.0 # 複製模型 COPY mlruns/ /opt/mlflow/mlruns/ # 暴露端口 EXPOSE 5000 # 啟動服務 CMD ["mlflow", "models", "serve", "-m", "mlruns:/<run_id>/model", "-h", "0.0.0.0", "-p", "5000"] ### 3.3 部署到 Kubernetes 1. **建構映像** bash docker build -t myorg/iris-model:latest . docker push myorg/iris-model:latest 2. **撰寫 Kubernetes 部署清單** **iris-deployment.yaml** yaml apiVersion: apps/v1 kind: Deployment metadata: name: iris-model spec: replicas: 3 selector: matchLabels: app: iris-model template: metadata: labels: app: iris-model spec: containers: - name: iris-model image: myorg/iris-model:latest ports: - containerPort: 5000 3. **部署** bash kubectl apply -f iris-deployment.yaml 4. **服務暴露** **iris-service.yaml** yaml apiVersion: v1 kind: Service metadata: name: iris-service spec: selector: app: iris-model ports: - protocol: TCP port: 80 targetPort: 5000 type: LoadBalancer bash kubectl apply -f iris-service.yaml > **小提示**：使用 **Horizontal Pod Autoscaler** 監測 CPU/記憶體使用率，自動擴容。 ## 4. 監控與漂移檢測 ### 4.1 監控指標 | 指標 | 目的 | |------|------| | **Latency** | 反映模型推論速度，確保符合 SLA | | **Throughput** | 觀察請求量，預測資源需求 | | **Error Rate** | 追蹤異常或失敗率 | | **Model Accuracy Drift** | 檢測模型預測品質下降 | > **工具**：Prometheus + Grafana（收集與可視化） ### 4.2 漂移檢測流程 1. **數據漂移（Data Drift）**：特徵分布偏離訓練分布。\n2. **概念漂移（Concept Drift）**：標籤與特徵之間的關係變化。\n3. **檢測方法**： - **KS 檢定**：比較訓練集與現場特徵分布。 - **Population Stability Index (PSI)**：評估分布變化程度。 - **Concept Drift 監測**：用預測結果與實際標籤計算 **F1** 或 **Accuracy** 之滑動平均。 **示例：使用 Scikit‑Learn 進行 PSI** python import numpy as np from scipy.stats import norm def calculate_psi(expected, actual, buckets=10): breakpoints = np.linspace(0, 1, buckets + 1) expected_percents = np.histogram(expected, breakpoints)[0] / len(expected) actual_percents = np.histogram(actual, breakpoints)[0] / len(actual) psi = np.sum((expected_percents - actual_percents) * np.log(expected_percents / actual_percents)) return psi > **警告**：PSI > 0.25 通常視為較大漂移，需進一步檢查。 ## 5. A/B 測試與藍綠部署 ### 5.1 A/B 測試基礎 - **目標**：比較兩個或多個模型版本的業務效益（例如 CTR、轉化率）。 - **設計**：隨機將流量分配至各版本，統計測試持續至達到足夠統計顯著性。 - **指標**：CTR、轉化率、營收、成本等。 ### 5.2 Kubernetes 上的藍綠部署 | 步驟 | 說明 | |------|------| | **創建兩個 Deployment** | `model-blue` 與 `model-green` | | **創建兩個 Service** | 兩個 Service 分別指向不同的 Deployment | | **Ingress 或 LoadBalancer 進行流量切換** | 透過 `weight` 或 `canary` 參數調整流量比例 | | **監控** | 若新版本不符合預期，即可快速回滾至舊版本 | > **工具**：Istio、Linkerd、Kustomize 等可協助實現流量拆分。 ## 6. CI/CD Pipeline（持續交付） | 位置 | 角色 | |------|------| | **GitHub / GitLab** | 版本控制、代碼存儲 | | **GitHub Actions / GitLab CI** | 自動化構建、測試、部署 | | **MLflow Tracking Server** | 實驗日誌、模型版本管理 | | **Docker Registry** | 存放容器映像 | | **Kubernetes** | 執行實際服務 | ### 6.1 Pipeline 範例 yaml # .github/workflows/deploy.yml name: CI/CD Pipeline on: push: branches: [ main ] jobs: build: runs-on: ubuntu-latest steps: - uses: actions/checkout@v3 - name: Set up Python uses: actions/setup-python@v4 with: python-version: '3.10' - name: Install dependencies run: pip install -r requirements.txt - name: Run tests run: pytest - name: Build and push Docker image uses: docker/build-push-action@v4 with: context: . push: true tags: myorg/iris-model:latest - name: Deploy to K8s uses: azure/setup-kubectl@v1 with: version: 'v1.28' env: KUBECONFIG: ${{ secrets.KUBECONFIG }} run: | kubectl set image deployment/iris-model iris-model=myorg/iris-model:latest > **備註**：可在測試階段使用 **MLflow** 的 `mlflow.register_model` 將新模型註冊到模型倉庫，並在部署前進行灰度驗證。 ## 7. 持續迭代與改進 1. **模型監控回饋**：將漂移指標回傳至實驗平台，觸發自動化重新訓練流程。\n2. **自動化重新訓練**：使用 **Kubeflow Pipelines** 或 **Airflow** 定期觸發 retraining job。\n3. **版本對照**：每次訓練都以 `run_id` 方式存檔，確保可回溯。\n4. **多模型比較**：在 A/B 測試結束後，將勝出的版本提升至正式環境；失敗者保留於測試環境。 ## 8. 實際案例：線上零售推薦系統 | 步驟 | 實作細節 | |------|-----------| | 1. 模型訓練 | 使用 LightGBM 預測用戶點擊率，版本 v1.0 存於 MLflow | | 2. Docker 化 | `Dockerfile` 安裝 `mlflow`, `lightgbm`, `pandas` | | 3. K8s 部署 | 兩個 ReplicaSet，Blue 與 Green | | 4. A/B 測試 | 50/50 流量，測試 48 小時 | | 5. 漂移檢測 | 每 30 分鐘計算 PSI，PSI>0.2 即觸發 retraining | | 6. 成效 | CTR 提升 3.5%，營收提升 2.8% | ## 9. 小結 1. **部署**：MLflow + Docker + Kubernetes 是最常見、最可擴充的組合，能夠在雲端或私有環境中快速交付。 2. **監控**：結合 Prometheus + Grafana，並加入模型漂移檢測，能及時發現品質下降。 3. **A/B 測試**：藍綠或金絲雀部署降低風險，確保新模型帶來實際效益。 4. **CI/CD**：自動化流水線加速從開發到生產的迭代週期，保持模型新鮮。 5. **倫理合規**：持續監控模型公平性、隱私影響，確保符合 GDPR、PDPA 等法規。 > **未來方向**：隨著 AutoML、Edge AI 與 Serverless 的興起，部署方式將進一步簡化，亦可能出現「模型即服務」 (Model-as-a-Service) 的雲原生模式。\n --- > **參考資料**： > - MLflow 官方文件: https://mlflow.org/docs/latest/ > - Kubernetes 官方文件: https://kubernetes.io/docs/ > - Prometheus 官方文件: https://prometheus.io/docs/ > - Scikit‑Learn Drift Toolkit: https://github.com/scikit-mdrift/drift

第7章模型評估與可解釋性

第9章：數據倫理與責任