第四章：機器學習入門—從監督學習到非監督學習

發布於 2026-02-26 09:45

# 第四章：機器學習入門—從監督學習到非監督學習本章將把讀者從統計推論的基礎推進至機器學習的實務。雖然前一章已經介紹了交叉驗證與模型評估，但在機器學習中，我們更需要對模型種類、特徵工程與評估指標有系統的認識。以下以監督學習與非監督學習為核心，結合實際 Python 範例與可重複實驗設計，幫助你快速落地。 --- ## 4.1 監督學習概念監督學習指的是模型以「輸入 X → 輸出 Y」的關係學習資料中的模式，並以 Y 作為學習的「標籤」。常見的問題類型有： - **迴歸（Regression）**：預測連續數值。 - **分類（Classification）**：預測離散標籤。 > **注意**：即使問題最終是分類，模型往往會先預測連續的機率，再做二元或多元決策。 --- ## 4.2 典型模型與實作以下以 scikit‑learn 為例，示範兩個基本模型：線性迴歸與邏輯迴歸。為了保證可重複性，我們在每個 notebook cell 末尾加上 `# ----`。 ```python # --- 4.2.1 線性迴歸示範 from sklearn.datasets import load_boston from sklearn.model_selection import train_test_split from sklearn.linear_model import LinearRegression from sklearn.metrics import mean_squared_error, r2_score boston = load_boston() X_train, X_test, y_train, y_test = train_test_split( boston.data, boston.target, test_size=0.2, random_state=42 ) model = LinearRegression() model.fit(X_train, y_train) y_pred = model.predict(X_test) print("MSE:", mean_squared_error(y_test, y_pred)) print("R²:", r2_score(y_test, y_pred)) # ---- ``` ```python # --- 4.2.2 邏輯迴歸示範 from sklearn.datasets import load_iris from sklearn.preprocessing import LabelBinarizer from sklearn.model_selection import train_test_split from sklearn.linear_model import LogisticRegression from sklearn.metrics import accuracy_score iris = load_iris() X_train, X_test, y_train, y_test = train_test_split( iris.data, iris.target, test_size=0.2, random_state=42 ) model = LogisticRegression(max_iter=200) model.fit(X_train, y_train) y_pred = model.predict(X_test) print("Accuracy:", accuracy_score(y_test, y_pred)) # ---- ``` > **小提示**：雖然 `load_boston` 仍可用於教學，但在實務上建議使用更為安全的資料集，或自行下載。 --- ## 4.3 評估指標 | 指標 | 適用情境 | 公式/說明 | |------|----------|------------| | MSE | 迴歸 | ∫ (y‑ŷ)² / n | | MAE | 迴歸 | ∫ |y‑ŷ| / n | | R² | 迴歸 | 1 – SSE/SST | | Accuracy | 分類 | ∫ (correct) / n | | F1‑Score | 分類 | 2 * Precision * Recall / (Precision + Recall) | > **備註**：在分類中，若類別不平衡，Accuracy 可能會產生誤導。此時應考慮使用召回率、精準率或 ROC‑AUC。 | --- ## 4.4 交叉驗證重點在機器學習中，交叉驗證不僅用於評估模型穩健性，還能協助參數調優。以 `GridSearchCV` 為例： ```python # --- 4.4.1 GridSearchCV 示範 from sklearn.model_selection import GridSearchCV from sklearn.svm import SVC param_grid = { "C": [0.1, 1, 10], "kernel": ["linear", "rbf"], "gamma": [1, 0.1, 0.01] } svc = SVC() grid = GridSearchCV(svc, param_grid, cv=5, scoring="accuracy") grid.fit(X_train, y_train) print("Best params:", grid.best_params_) print("Best CV accuracy:", grid.best_score_) # ---- ``` > **提示**：若資料量極大，考慮使用 `RandomizedSearchCV` 以降低計算成本。 --- ## 4.5 實務案例：二元分類以下以 `seaborn` 的 `tips` 資料集為例，建構「高小費 (tip >= 5)」與「低小費」的分類模型。全程將程式碼拆成獨立 cell，並加入 `# ----` 以便後續重複執行。 ```python # 4.5.1 讀取資料 & 預處理 import seaborn as sns import pandas as pd tips = sns.load_dataset("tips") tips['high_tip'] = (tips['tip'] >= 5).astype(int) X = tips[['total_bill', 'size']] y = tips['high_tip'] # ---- ``` ```python # 4.5.2 分割資料 & 建模 from sklearn.model_selection import train_test_split from sklearn.preprocessing import StandardScaler from sklearn.pipeline import make_pipeline from sklearn.linear_model import LogisticRegression X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.3, random_state=42 ) pipe = make_pipeline(StandardScaler(), LogisticRegression()) pipe.fit(X_train, y_train) print("Test accuracy:", pipe.score(X_test, y_test)) # ---- ``` --- ## 4.6 非監督學習概述非監督學習主要處理 **未標籤** 的資料，目標是發現資料內在結構。常見任務有： - **聚類（Clustering）**：將相似的觀測點聚在一起。 - **降維（Dimensionality Reduction）**：用較少的變數表達資料。 - **異常偵測（Anomaly Detection）**：找出偏離常態的點。以下以 K‑Means 與 PCA 為例。 --- ## 4.7 實務案例：聚類以 `seaborn` 的 `iris` 資料集做示範，先使用 PCA 將四維資料降至二維，接著進行 K‑Means 聚類，最後用散點圖可視化結果。 ```python # 4.7.1 讀取資料 & PCA from sklearn.decomposition import PCA import seaborn as sns import matplotlib.pyplot as plt iris = sns.load_dataset("iris") X = iris.drop("species", axis=1) pca = PCA(n_components=2) X_pca = pca.fit_transform(X) # ---- ``` ```python # 4.7.2 K‑Means 聚類 from sklearn.cluster import KMeans kmeans = KMeans(n_clusters=3, random_state=42) clusters = kmeans.fit_predict(X_pca) iris['cluster'] = clusters plt.figure(figsize=(8,6)) plt.scatter(X_pca[:,0], X_pca[:,1], c=clusters, cmap='viridis', s=50) plt.title("K‑Means 聚類結果 (PCA 2D)") plt.xlabel("主成份 1") plt.ylabel("主成份 2") plt.colorbar(label="Cluster ID") plt.show() # ---- ``` > **小結**：聚類的效果與 PCA 的投影方向緊密相關。若你想更精細的分層，可考慮使用 **階層式聚類** 或 **DBSCAN**。 --- ## 4.8 特徵工程與自動化 1. **特徵擴充**：交叉項、對數轉換、時間戳拆分。 2. **缺失值處理**：平均/中位數填補、插值、或直接丟棄。 3. **管線化 (Pipeline)**：將資料處理、特徵工程與模型訓練串接，避免資料洩漏。 ```python # 4.8.1 Pipeline 示範 from sklearn.compose import ColumnTransformer from sklearn.preprocessing import OneHotEncoder from sklearn.impute import SimpleImputer numeric_features = ["age", "income"] numeric_transformer = make_pipeline( SimpleImputer(strategy='median'), StandardScaler() ) categorical_features = ["education", "occupation"] categorical_transformer = make_pipeline( SimpleImputer(strategy='most_frequent'), OneHotEncoder(handle_unknown='ignore') ) preprocessor = ColumnTransformer( transformers=[ ("num", numeric_transformer, numeric_features), ("cat", categorical_transformer, categorical_features) ] ) ``` --- ## 4.9 可重複實驗設計 - **固定隨機種子**：所有 `train_test_split`、`KMeans`、`SVC` 等都設置 `random_state`。 - **使用管線**：避免資料洩漏，將預處理與模型訓練放在同一個 `Pipeline` 或 `FeatureUnion`。 - **輸出結果檔案**：將每一次實驗的 hyper‑parameters 與 performance 存至 CSV，以便追蹤。 ```python # 4.9.1 儲存實驗結果 results = { "timestamp": pd.Timestamp.now(), "model": "LogisticRegression", "accuracy": pipe.score(X_test, y_test) } pd.DataFrame([results]).to_csv("experiment_log.csv", mode='a', header=not pd.io.common.file_exists("experiment_log.csv")) # ---- ``` --- ## 4.10 小結 - 監督學習與非監督學習是機器學習的兩大基礎；理解其問題類型、模型選擇與評估指標，才能正確地建模與部署。 - 交叉驗證不僅是評估工具，更是參數調優的重要手段。 - 可重複實驗設計（固定隨機種子、使用 Pipeline、加 `# ----`）能確保實驗結果的可比較性與可維護性。接下來的章節將進一步探討深度學習、強化學習與模型部署，讓你能在不同領域（金融、醫療、電商）快速落地。祝學習愉快！

第三章：統計學基礎與假設檢定

第5章深度學習基礎與實務應用