3. 資料前處理：清洗、轉換與特徵工程

發布於 2026-02-21 01:22

# 3. 資料前處理：清洗、轉換與特徵工程資料前處理是整個數據科學流程中最具挑戰且最具影響力的階段。乾淨、結構化且富含資訊的資料能顯著提升模型表現，降低偏差與方差，並確保模型在實務部署中的可持續性。本章將從缺失值處理、資料轉換、離群值檢測，到特徵工程的完整工作流程與最佳實務，提供理論基礎、實務範例與可落地的實作指引。 --- ## 3.1 缺失值處理 | 缺失值處理策略 | 何時適用 | 主要優點 | 潛在風險 | |-----------------|-----------|----------|----------| | 刪除行/列 | 缺失比例低、資料量足 | 簡單快速 | 可能失去重要資訊 | | 估算填補 | 缺失比例中等 | 保留所有樣本 | 可能帶入偏差 | | 先驗/模型預測填補 | 大量缺失，業務有先驗 | 避免資訊喪失 | 需要先驗知識 | ### 3.1.1 估算填補方法 - **統計量填補**：均值、中位數、眾數 - **kNN 填補**：根據相似鄰居估算 - **迴歸填補**：利用其他特徵預測缺失值 - **多重插補（Multiple Imputation）**：考慮不確定性 ### 3.1.2 Python 例子 python import pandas as pd from sklearn.impute import SimpleImputer, KNNImputer # 範例資料 df = pd.DataFrame({ 'age': [25, 30, None, 45, 38], 'income': [50000, None, 62000, 72000, 58000], 'gender': ['M', 'F', 'F', None, 'M'] }) # 1. 均值填補（數值） num_imputer = SimpleImputer(strategy='mean') df['age'] = num_imputer.fit_transform(df[['age']]) # 2. kNN 填補（同時處理數值與類別） knn_imputer = KNNImputer(n_neighbors=3) df[['age', 'income']] = knn_imputer.fit_transform(df[['age', 'income']]) # 3. 眾數填補（類別） cat_imputer = SimpleImputer(strategy='most_frequent') df['gender'] = cat_imputer.fit_transform(df[['gender']]) print(df) > **實務建議**：先觀察缺失模式（完全隨機、條件隨機或非隨機）。對於非隨機缺失，盡量採用模型預測或多重插補，並在報告中說明潛在偏差。 --- ## 3.2 資料轉換與標準化 | 轉換方法 | 用途 | 典型函式 | 範例 | |-----------|------|----------|------| | **標準化 (StandardScaler)** | 離均差為 0、單位方差 | `sklearn.preprocessing.StandardScaler` | 迴歸、SVM、KNN 需對距離敏感 | | **正規化 (MinMaxScaler)** | 將值映射到 [0,1] | `sklearn.preprocessing.MinMaxScaler` | 圖像、深度學習前處理 | | **對數變換** | 處理偏態分布 | `np.log1p` | 收入、銷售額 | | **Box-Cox / Yeo-Johnson** | 使資料更接近正態 | `scipy.stats.boxcox` | 對稱化、降低異方差 | | **分箱 (Binning)** | 降低離散特徵複雜度 | `pandas.cut` | 年齡分級 | ### 3.2.1 Python 例子 python from sklearn.preprocessing import StandardScaler, MinMaxScaler import numpy as np X = np.array([[1, 2], [2, 0], [0, 0], [4, 6]]) # 標準化 scaler_std = StandardScaler() X_std = scaler_std.fit_transform(X) print('Standardized:\n', X_std) # 正規化 scaler_mm = MinMaxScaler() X_mm = scaler_mm.fit_transform(X) print('MinMax:\n', X_mm) --- ## 3.3 離群值檢測 | 方法 | 何時適用 | 優點 | 缺點 | |------|-----------|------|------| | **Z-Score** | 正態分布 | 直觀 | 受極值影響 | | **IQR (四分位距)** | 非正態 | 對極值不敏感 | 需要選擇分位數 | | **Isolation Forest** | 高維、非線性 | 適用大資料集 | 需要參數調整 | | **DBSCAN** | 空間聚類 | 自動偵測密度 | 參數難調 | ### 3.3.1 Python 例子 python import pandas as pd from sklearn.ensemble import IsolationForest # 範例資料 np.random.seed(42) data = pd.DataFrame({'feature': np.random.normal(0, 1, 1000)}) # 加入離群點 data = pd.concat([data, pd.DataFrame({'feature': [10, -10, 12]})]) # Isolation Forest iso = IsolationForest(contamination=0.01, random_state=42) data['anomaly'] = iso.fit_predict(data[['feature']]) print('離群點比例：', (data['anomaly'] == -1).mean()) --- ## 3.4 特徵工程 ### 3.4.1 特徵選擇 | 類型 | 描述 | 典型方法 | |------|------|----------| | **Filter** | 先行評估每個特徵的重要性 | 相關係數、卡方檢定、互信息 | | **Wrapper** | 透過模型性能評估 | 前向選擇、後退剔除、遺傳演算法 | | **Embedded** | 在模型訓練時同時進行特徵選擇 | Lasso, Tree-based feature importance | ### 3.4.2 特徵構造 | 構造手法 | 典型範例 | |---------|-----------| | **交互特徵** | `age * income`、`num_items * avg_price` | | **多項式特徵** | `np.power(feature, 2)`、`PolynomialFeatures` | | **時間特徵** | 星期、季節、工作日與週末 | | **分割特徵** | `np.floor(log_feature / 10)` | | **聚合特徵** | 交易紀錄總量、平均交易額 | ### 3.4.3 文本特徵 | 轉換 | 函式 | |-------|------| | **Tokenization / TF‑IDF** | `sklearn.feature_extraction.text.TfidfVectorizer` | | **Word Embedding** | Word2Vec, GloVe | | **Label Encoding / One-Hot** | `pandas.get_dummies`, `sklearn.preprocessing.OneHotEncoder` | ### 3.4.4 Python 例子 python from sklearn.preprocessing import PolynomialFeatures import pandas as pd # 原始特徵 X = pd.DataFrame({'age': [25, 35, 45], 'income': [50000, 65000, 80000]}) # 交互特徵 X['age_income'] = X['age'] * X['income'] # 多項式特徵（degree=2） poly = PolynomialFeatures(degree=2, include_bias=False) X_poly = poly.fit_transform(X[['age', 'income']]) print('Polynomial Features:\n', X_poly) --- ## 3.5 Pipeline 的構建利用 **scikit‑learn Pipeline** 將前處理流程封裝成可重複、可追蹤的單位，避免資料洩露（data leakage）與程式碼重複。 python from sklearn.pipeline import Pipeline from sklearn.preprocessing import StandardScaler, OneHotEncoder from sklearn.compose import ColumnTransformer from sklearn.impute import SimpleImputer from sklearn.linear_model import LinearRegression # 數值與類別特徵列表 numeric_features = ['age', 'income'] categorical_features = ['gender', 'region'] # 數值處理流水線 numeric_transformer = Pipeline([ ('imputer', SimpleImputer(strategy='median')), ('scaler', StandardScaler()) ]) # 類別處理流水線 categorical_transformer = Pipeline([ ('imputer', SimpleImputer(strategy='most_frequent')), ('onehot', OneHotEncoder(handle_unknown='ignore')) ]) # 組合多列轉換器 preprocessor = ColumnTransformer([ ('num', numeric_transformer, numeric_features), ('cat', categorical_transformer, categorical_features) ]) # 完整 Pipeline model = Pipeline([ ('preprocess', preprocessor), ('regressor', LinearRegression()) ]) # 假設 df 為輸入資料 model.fit(df[numeric_features + categorical_features], df['target']) print('模型係數：', model.named_steps['regressor'].coef_) --- ## 3.5 資料前處理最佳實務 | 項目 | 建議做法 | |------|----------| | **版本控制** | 以 `DVC`、`MLflow` 或 `Delta Lake` 管理前處理腳本與資料版本 | 版本號化、元資料儲存 | | **自動化** | CI/CD 流程中加入資料前處理測試 | `pytest`, `ruff`, `pre-commit` | | **記錄與追蹤** | 使用 `pandas-profiling`、`sweetviz` 產生資料摘要 | 方便回溯、團隊協作 | | **文件化** | 產生 `README.md` 或 `data_processing.ipynb`，說明每一步假設與參數 | 方便新人快速上手 | | **性能監控** | 監測前處理時間、記憶體消耗 | 針對大規模資料集調整批次處理 | --- ## 3.6 案例研究：零售銷售預測 ### 3.6.1 數據概覽 | 欄位 | 類型 | 缺失率 | 主要資訊 | |------|------|--------|----------| | `store_id` | 類別 | 0% | 站點識別 | | `date` | 日期 | 0% | 交易時間 | | `items_sold` | 數值 | 5% | 銷售數量 | | `avg_price` | 數值 | 2% | 平均單價 | | `promo_flag` | 類別 | 0% | 促銷活動 | ### 3.6.2 前處理流程 1. **日期特徵擴充**：`day_of_week`, `month`, `is_weekend` 2. **缺失值填補**：`items_sold` 以中位數填補；`avg_price` 以 kNN 填補 3. **對數變換**：`items_sold`、`avg_price` 4. **標準化**：對於 `items_sold`, `avg_price` 5. **離群值檢測**：IQR 方式剔除極端值 6. **交互特徵**：`items_sold * avg_price` 7. **特徵選擇**：使用 Tree‑based importance 篩選出 5 個關鍵特徵 8. **Pipeline**：使用 `ColumnTransformer` 及 `Pipeline` 封裝整個流程 ### 3.6.3 Pipeline 代碼片段 python from sklearn.pipeline import Pipeline from sklearn.compose import ColumnTransformer from sklearn.preprocessing import StandardScaler, OneHotEncoder from sklearn.impute import SimpleImputer from sklearn.ensemble import RandomForestRegressor numeric_features = ['items_sold', 'avg_price'] categorical_features = ['promo_flag', 'day_of_week', 'month'] numeric_transformer = Pipeline([ ('imputer', SimpleImputer(strategy='median')), ('scaler', StandardScaler()) ]) categorical_transformer = Pipeline([ ('imputer', SimpleImputer(strategy='most_frequent')), ('onehot', OneHotEncoder(handle_unknown='ignore')) ]) preprocessor = ColumnTransformer([ ('num', numeric_transformer, numeric_features), ('cat', categorical_transformer, categorical_features) ]) model = Pipeline([ ('preprocess', preprocessor), ('rf', RandomForestRegressor(n_estimators=200, random_state=42)) ]) # 假設 df 為前處理後資料 X = df[numeric_features + categorical_features] y = df['target_sales'] model.fit(X, y) print('訓練完成，模型重要性：', model.named_steps['rf'].feature_importances_) --- ## 3.7 小結 - **缺失值**：先觀察缺失模式，再選擇合適填補方法；若缺失非隨機，務必在報告中說明偏差。 - **資料轉換**：標準化與正規化取決於模型對距離的敏感度；偏態分布可用對數或 Box‑Cox 變換。 - **離群值**：Z‑Score 受極值影響，IQR 更穩健；高維可使用 Isolation Forest 或 DBSCAN。 - **特徵工程**：特徵選擇與構造相輔相成，建議使用 Pipeline 以避免資料洩露。 - **最佳實務**：版本控制、追蹤元資料、單元測試與文件化是確保可重現與可持續部署的關鍵。透過上述方法與流程，您將能在不同業務場景中高效地完成資料前處理，為後續模型訓練與部署奠定堅實基礎。

第二章：資料的起點 — 從蒐集到清洗的實務流程

4. 可視化與探索性資料分析 (EDA)