第7章模型評估與可解釋性

發布於 2026-03-05 11:48

# 第7章模型評估與可解釋性本章聚焦於兩大核心：**模型性能評估** 與 **可解釋性（Explainability）**。透過合適的指標，我們能夠量化模型的表現；透過可解釋方法，我們能將黑盒模型轉化為可被人理解的決策工具。兩者結合，讓模型不僅準確，更能說服業務同仁、符合合規需求，並持續推進資料科學的價值。 --- ## 7.1 性能指標概覽 ### 7.1.1 回歸指標 | 指標 | 定義 | 公式 | 何時使用 | |------|------|------|----------| | **MAE** | 平均絕對誤差 | \[ \text{MAE} = \frac{1}{n}\sum_{i=1}^{n}|y_i - \hat{y}_i| \] | 數值尺度相同、對離群點不敏感 | | **MSE** | 均方誤差 | \[ \text{MSE} = \frac{1}{n}\sum_{i=1}^{n}(y_i - \hat{y}_i)^2 \] | 離群點影響較大，需強調嚴重錯誤 | | **RMSE** | 均方根誤差 | \[ \text{RMSE} = \sqrt{\text{MSE}} \] | 與原始尺度一致，易於解釋 | | **R²** | 決定係數 | \[ R^2 = 1 - \frac{\sum (y_i-\hat{y}_i)^2}{\sum (y_i-\bar{y})^2} \] | 0~1，越接近 1 代表解釋力越高 | | **Adjusted R²** | 調整後決定係數 | \[ \bar{R}^2 = 1 - \left(1-R^2\right)\frac{n-1}{n-p-1} \] | 針對多變量模型，懲罰過度擬合 | ### 7.1.2 分類指標 | 指標 | 定義 | 公式 | 何時使用 | |------|------|------|----------| | **Accuracy** | 正確率 | \[ \frac{TP+TN}{TP+TN+FP+FN} \] | 取決於類別平衡 | | **Precision** | 精確率 | \[ \frac{TP}{TP+FP} \] | 重要在錯誤判為正例代價高時 | | **Recall** | 召回率 | \[ \frac{TP}{TP+FN} \] | 重要在漏判代價高時 | | **F1-score** | 平衡精確率與召回率 | \[ 2\times\frac{Precision\times Recall}{Precision+Recall} \] | 選擇權衡兩者 | | **AUC‑ROC** | 接收者操作特徵曲線下面積 | AUC 取值 0~1 | 兩類分離度量 | | **AUC‑PR** | 精度召回曲線下面積 | AUC 取值 0~1 | 極度不平衡時更適合 | | **Log Loss** | 交叉熵損失 | \[ -\frac{1}{n}\sum_{i=1}^{n}[y_i\log(p_i)+(1-y_i)\log(1-p_i)] \] | 需要概率輸出 | ### 7.1.3 交叉驗證 > **k‑fold CV**：將資料切成 k 份，輪流作為驗證集，其餘作為訓練集。最常見的做法是 k=5 或 k=10。計算多次評估指標的平均值與標準差，可更可靠地估計模型泛化能力。 python from sklearn.model_selection import cross_val_score, StratifiedKFold from sklearn.metrics import make_scorer, f1_score skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42) f1_scorer = make_scorer(f1_score) scores = cross_val_score(rf, X, y, cv=skf, scoring=f1_scorer) print('F1 mean:', scores.mean(), 'std:', scores.std()) --- ## 7.2 混淆矩陣混淆矩陣直觀展示模型在各類別上的表現。對於多類別，矩陣擴展為 n×n。以下為二類別示例： | | 預測正類 | 預測負類 | |---|---|---| | 真正類 | TP | FN | | 真負類 | FP | TN | > **提醒**：在不平衡資料集（例如罕見事件偵測），Accuracy 可能失真，應依賴召回率、F1、AUC 等指標。 python from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay import matplotlib.pyplot as plt cm = confusion_matrix(y_test, y_pred) disp = ConfusionMatrixDisplay(confusion_matrix=cm) disp.plot() plt.title('Confusion Matrix') plt.show() --- ## 7.3 可解釋性方法隨著模型複雜度提升，**可解釋性** 越來越重要。以下列出常見技術，並說明其適用場景。 ### 7.3.1 全局解釋 vs 局部解釋 | 類型 | 目的 | 典型方法 | |------|------|----------| | **全局** | 解釋整體模型行為 | 變量重要性、PDP、ICE、Partial Dependence Plot | | **局部** | 解釋單一預測 | SHAP, LIME, Anchors | ### 7.3.2 SHAP (SHapley Additive exPlanations) - 基於 Shapley 值（博弈論）計算每個特徵對預測的貢獻。 - **Advantages**：一致性、可加性、可視化靈活。 - **Common Visuals**： - **Summary Plot**：所有樣本的特徵重要性分布。 - **Dependence Plot**：某特徵值與 SHAP 值之關係。 - **Force Plot**：單一預測的貢獻分解。 python import shap explainer = shap.TreeExplainer(rf) shap_values = explainer.shap_values(X_test) # Summary Plot shap.summary_plot(shap_values, X_test) # Force Plot for first instance shap.force_plot(explainer.expected_value, shap_values[0], X_test.iloc[0]) ### 7.3.3 LIME (Local Interpretable Model‑agnostic Explanations) - 以局部線性模型近似黑盒模型在某個預測附近的行為。 - **Usage**：快速得到「為何這個客戶被預測為非流失」的直觀說明。 python from lime import lime_tabular explainer = lime_tabular.LimeTabularExplainer( training_data=np.array(X_train), feature_names=X_train.columns, class_names=['Non‑Churn','Churn'], discretize_continuous=True ) exp = explainer.explain_instance(X_test.iloc[0], rf.predict_proba, num_features=5) exp.show_in_notebook(show_table=True) ### 7.3.4 Partial Dependence Plot (PDP) & ICE - PDP：觀察單一特徵改變時，模型預測的平均變化。 - ICE：每個樣本單獨的依賴曲線，可顯示個體差異。 python from sklearn.inspection import plot_partial_dependence features = ['age', 'income'] plot_partial_dependence(rf, X_test, features, kind='both') plt.show() --- ## 7.4 實例：信用卡欺詐偵測以下示範從資料載入、模型評估到 SHAP 可視化的完整流程。 python import pandas as pd from sklearn.model_selection import train_test_split from sklearn.ensemble import RandomForestClassifier from sklearn.metrics import roc_auc_score, f1_score, confusion_matrix import shap # 1. 載入資料 credit = pd.read_csv('creditcard.csv') X = credit.drop('Class', axis=1) y = credit['Class'] # 2. 分割資料 X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.3, stratify=y, random_state=42 ) # 3. 模型訓練 rf = RandomForestClassifier(n_estimators=200, max_depth=12, random_state=42) rf.fit(X_train, y_train) # 4. 評估 pred = rf.predict(X_test) proba = rf.predict_proba(X_test)[:,1] print('AUC:', roc_auc_score(y_test, proba)) print('F1:', f1_score(y_test, pred)) print('Confusion Matrix:\n', confusion_matrix(y_test, pred)) # 5. SHAP 解析 explainer = shap.TreeExplainer(rf) shap_values = explainer.shap_values(X_test) shap.summary_plot(shap_values[1], X_test) > **觀察**：在信用卡欺詐問題中，**V14**、**V12** 等特徵往往表現出高 SHAP 值，表明它們對模型決策影響巨大。透過 Force Plot，可進一步說明單筆交易被判斷為欺詐的原因，方便審計與合規。 --- ## 7.5 可解釋性最佳實務 | 建議 | 為什麼重要 | |------|--------------| | **設定解釋目標** | 選擇全局或局部解釋取決於決策流程與合規要求 | | **與領域專家合作** | 專家能將數值解釋轉化為業務語言 | | **可視化簡潔** | 直觀圖表更易於決策者快速理解 | | **版本化解釋** | 隨模型迭代，保持解釋的可追溯性 | | **合規審核** | 確認解釋不暴露敏感資訊，遵守 GDPR、個資法 | | **自動化測試** | 在 CI/CD 中加入解釋性指標檢查，防止模型偏差 | --- ## 7.6 實務工具總結 | 套件 | 功能 | 範例程式碼 | |------|------|------------| | `scikit-learn` | metrics、cross_val_score、PDP | `from sklearn.metrics import f1_score` | | `shap` | 全局/局部可解釋 | `explainer = shap.TreeExplainer(rf)` | | `lime` | 局部可解釋 | `explainer = lime_tabular.LimeTabularExplainer(...)` | | `yellowbrick` | 視覺化評估 | `from yellowbrick.classifier import ROCAUC` | | `mlflow` | 實驗管理、metric logging | `mlflow.log_metric('f1', f1)` | --- ## 小結 1. **性能評估**：選擇合適的指標（回歸 vs 分類、平衡 vs 不平衡），透過交叉驗證估計泛化能力。 2. **混淆矩陣**：直觀展現錯誤類型，特別在不平衡場景下的指標解讀。 3. **可解釋性**：SHAP 與 LIME 為兩大主流方法，分別提供全局與局部說明。PDP、ICE、特徵重要性則輔助解釋。 4. **實務結合**：在 CI/CD、MLflow 之中，將評估指標與解釋輸出一同儲存，確保實驗可追溯。 5. **倫理與合規**：可解釋性不僅提升模型透明度，也是法律合規的必要步驟。透過本章的指標與工具，讀者可以在實務專案中不僅驗證模型表現，更能向決策者、合規人員清楚說明「為什麼」模型會這樣做，進一步提升資料科學的商業價值與社會信任。

第6章：機器學習與模型建構

第8章：部署、監控與迭代

聊天視窗

第7章 模型評估與可解釋性

第7章模型評估與可解釋性