第3章資料前處理與清洗

發布於 2026-03-01 00:13

# 第3章資料前處理與清洗資料前處理是數據科學流程中不可或缺的環節。它不僅是提升模型效能的關鍵，也是確保商業洞察可信度的基石。本章將深入探討缺失值處理、異常偵測與資料轉型三大面向，並提供實際工具與程式範例，協助讀者在真實業務場景中快速落地。 --- ## 3.1 缺失值處理 ### 3.1.1 缺失值概念 - **Missing Data**：資料在收集或傳輸過程中，因各種原因未被儲存的空缺位置。 - **Types**: - *Missing Completely at Random (MCAR)*：缺失機率與其他變量無關。 - *Missing at Random (MAR)*：缺失機率只與已觀測變量相關。 - *Missing Not at Random (MNAR)*：缺失機率與未觀測值本身相關。 ### 3.1.2 缺失值影響 1. **統計推論偏差**：平均值、標準差等統計量受缺失資料影響。 2. **模型效能下降**：機器學習模型需對缺失進行處理，否則訓練失敗或效能低下。 3. **決策失真**：商業報表可能因缺失而產生錯誤結論。 ### 3.1.3 處理策略 | 方法 | 何時適用 | 優缺點 | |---|---|---| | **刪除** (Complete Case / Listwise) | 缺失比例低 (<5%) | 簡單快速，保持資料完整性；但會喪失樣本量 | | **單值代入** (Mean/Median/Mode) | 缺失比例中等 | 易實作；但會降低變異性，影響模型特徵 | | **預測填補** (KNN, MICE, Regression) | 缺失比例高或資料分布複雜 | 保留資料量，擬合較準確；計算成本較高 | | **分組填補** (Segment-based) | 變量具有分群特徵 | 可以保留群組內差異 | ### 3.1.4 實作範例（Python / pandas） ```python import pandas as pd import numpy as np from sklearn.impute import SimpleImputer, KNNImputer # 讀取範例資料 df = pd.read_csv('sample_data.csv') # 1. 刪除缺失行 clean_df = df.dropna(subset=['target']) # 只保留目標欄完整的樣本 # 2. 單值代入 imputer_mean = SimpleImputer(strategy='mean') df['numeric_col'] = imputer_mean.fit_transform(df[['numeric_col']]) # 3. KNN 代入 knn_imputer = KNNImputer(n_neighbors=5) df[['col1','col2','col3']] = knn_imputer.fit_transform(df[['col1','col2','col3']]) ``` > **實務提醒**：在進行填補前，先利用 `df.isnull().sum()` 觀察缺失比例，並根據業務邏輯決定是否需要保留缺失值（例如客戶流失預測中，缺失的「最近一次交易時間」可能代表活躍度低）。 --- ## 3.2 異常偵測（Outlier Detection） ### 3.2.1 異常定義 - **Outlier**：在資料分布中，與大多數資料差距過大的觀測值。 - **Types**: - *Point outlier*：單一變量值偏離。 - *Contextual outlier*：同一變量在不同上下文（時間、地點）異常。 - *Collective outlier*：整個樣本或群組異常。 ### 3.2.2 為何重要 1. **模型過擬合**：異常可能被模型過度學習，影響泛化能力。 2. **數據品質**：異常往往代表輸入錯誤或測量問題。 3. **商業洞察**：某些異常是重要業務信號（例如金融詐騙、設備故障）。 ### 3.2.3 常見方法 | 方法 | 原理 | 適用情境 | |---|---|---| | **統計方法** (Z-score, IQR) | 以分佈統計量判定閾值 | 數值型、近正態分佈 | | **距離方法** (Mahalanobis, kNN distance) | 觀測值與鄰近點距離 | 低維數據、聚類前期 | | **密度方法** (DBSCAN, LOF) | 以點密度判斷離群 | 高維、非凸分佈 | | **模型方法** (Isolation Forest, One-Class SVM) | 以異常分離原理 | 大規模、非線性關係 | ### 3.2.4 實作範例（Python / scikit‑learn） ```python from sklearn.ensemble import IsolationForest import pandas as pd # 讀取資料 X = pd.read_csv('sales_data.csv')[['amount', 'discount']] # 兩個數值特徵 # Isolation Forest iso = IsolationForest(contamination=0.05, random_state=42) X['anomaly'] = iso.fit_predict(X) X['anomaly'] = X['anomaly'].map({1:0, -1:1}) # 0:正常, 1:異常 # 篩選異常 anomalies = X[X['anomaly']==1] print('偵測到', anomalies.shape[0], '筆異常資料') ``` > **實務技巧**：在異常偵測前，先做資料正規化或標準化，避免尺度差異影響距離或密度計算。 --- ## 3.3 資料轉型技巧與工具 ### 3.3.1 轉型需求 1. **資料一致性**：同一欄位使用統一格式（日期、貨幣、文字大小寫）。 2. **特徵工程**：建立有利於模型表現的新特徵（如對數轉換、交互項）。 3. **數據型別轉換**：將字串轉為類別型、數值型，或將數值型轉為分桶型。 ### 3.3.2 常用轉型操作 | 操作 | 說明 | 範例 | |---|---|---| | **正規化** (Min‑Max, Z‑Score) | 將數值縮放到相同範圍 | `scaled = (x - x.min()) / (x.max() - x.min())` | | **離散化** (分箱) | 將連續變量轉為離散類別 | `pd.cut(df['age'], bins=[0,18,35,60,100], labels=['青少年','青年','中年','老年'])` | | **特徵衍生** | 生成新的有用特徵 | `df['total_spent'] = df['price'] * df['quantity']` | | **日期拆分** | 轉成年、月、日、週、季 | `df['year'] = df['date'].dt.year` | | **文本向量化** | 轉成數值向量 | `TfidfVectorizer`、`CountVectorizer` | | **類別編碼** | One-Hot、Label、Target Encoding | `pd.get_dummies(df['color'])` | ### 3.3.3 轉型工具與框架 | 工具 | 優勢 | 適用範圍 | |---|---|---| | **pandas** | 易用、靈活 | 小到中等規模資料 | | **PySpark** | 分散式運算、處理大數據 | 大規模資料、雲端批次作業 | | **scikit‑learn** (Pipeline, ColumnTransformer) | 內建特徵轉換器、可連結至模型 | 機器學習工作流 | | **Featuretools** | 自動化特徵工程 | 時間序列、關聯式資料 | | **Databricks** | 整合Spark + MLflow | 雲端資料湖、模型追蹤 | ### 3.3.4 轉型實作案例（Pipeline） ```python from sklearn.compose import ColumnTransformer from sklearn.preprocessing import StandardScaler, OneHotEncoder from sklearn.pipeline import Pipeline import pandas as pd # 範例資料 X = pd.read_csv('customer_data.csv') numeric_features = ['age', 'income', 'score'] categorical_features = ['gender', 'region'] numeric_transformer = Pipeline([('scaler', StandardScaler())]) categorical_transformer = Pipeline([('onehot', OneHotEncoder(handle_unknown='ignore'))]) preprocessor = ColumnTransformer( transformers=[ ('num', numeric_transformer, numeric_features), ('cat', categorical_transformer, categorical_features) ]) # 整合進模型（以隨機森林為例） from sklearn.ensemble import RandomForestClassifier from sklearn.pipeline import Pipeline clf = Pipeline([ ('preprocess', preprocessor), ('model', RandomForestClassifier(n_estimators=200, random_state=42)) ]) # 拆分訓練/測試 from sklearn.model_selection import train_test_split X_train, X_test, y_train, y_test = train_test_split(X.drop('target', axis=1), X['target'], test_size=0.2, random_state=42) clf.fit(X_train, y_train) print('測試準確率:', clf.score(X_test, y_test)) ``` > **實務建議**：在建立 Pipeline 前，先對每個欄位做 **EDA**，確定數值型 vs 類別型，避免誤用編碼方式。 --- ## 3.4 綜合實戰：從原始資料到乾淨特徵 ### 3.4.1 預處理工作流程 1. **資料蒐集**：從資料湖/倉庫或外部 API 抽取。 2. **初步探索**：使用 `df.describe()`、`df.info()` 觀察結構與缺失。 3. **缺失處理**：根據缺失機制選擇合適方法。 4. **異常偵測**：剔除或標記異常值。 5. **轉型**：標準化、編碼、衍生特徵。 6. **資料分割**：訓練/驗證/測試。 7. **版本控制**：使用 Data Version Control (DVC) 或 MLflow 追蹤。 ### 3.4.2 範例：零售客戶流失預測 ```python import pandas as pd from sklearn.model_selection import train_test_split from sklearn.pipeline import Pipeline from sklearn.compose import ColumnTransformer from sklearn.preprocessing import StandardScaler, OneHotEncoder from sklearn.impute import SimpleImputer from sklearn.ensemble import GradientBoostingClassifier from sklearn.metrics import roc_auc_score # 讀取原始資料 raw = pd.read_csv('retail_customer.csv') # 1. 缺失處理 numeric_features = ['age', 'income', 'purchase_amount'] categorical_features = ['gender', 'membership_level'] numeric_transformer = Pipeline([ ('imputer', SimpleImputer(strategy='median')), ('scaler', StandardScaler()) ]) categorical_transformer = Pipeline([ ('imputer', SimpleImputer(strategy='most_frequent')), ('onehot', OneHotEncoder(handle_unknown='ignore')) ]) preprocessor = ColumnTransformer([ ('num', numeric_transformer, numeric_features), ('cat', categorical_transformer, categorical_features) ]) # 2. 模型 clf = Pipeline([ ('preprocess', preprocessor), ('model', GradientBoostingClassifier(n_estimators=300, learning_rate=0.05, max_depth=3, random_state=42)) ]) # 3. 分割 X = raw.drop('churn', axis=1) y = raw['churn'] X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, stratify=y, random_state=42) # 4. 訓練 clf.fit(X_train, y_train) # 5. 評估 pred_proba = clf.predict_proba(X_test)[:,1] print('AUC:', roc_auc_score(y_test, pred_proba)) ``` > **關鍵點**：在此流程中，所有轉型步驟都被封裝於 Pipeline，確保 **資料處理一致性**，並能在不同資料集（如 A/B 測試、上線環境）保持相同操作。 --- ## 3.5 小結 1. **缺失值處理**：根據缺失機制選擇刪除、單值代入或預測填補；使用 pandas、scikit‑learn 等工具快速實作。 2. **異常偵測**：Z‑score、IQR、Isolation Forest 等方法可根據資料分佈與業務需求選擇；異常往往既是噪音也可能是商業機會。 3. **資料轉型**：標準化、離散化、特徵衍生與類別編碼是提升模型表現的關鍵；Pipeline 方式可確保一致性與重現性。 4. **實務流程**：從資料蒐集、探索、處理、轉型到模型，形成一個可重複、可追蹤的工作流；結合版本控制與雲端資料湖可提升團隊協作效率。 > **未來展望**：隨著資料量與多樣性不斷擴大，**自動化特徵工程**（Featuretools）與**資料治理**工具（Great Expectations）將成為前線開發者的常用武器。掌握本章所述方法，您將能在任何業務場景中，將原始雜亂的資料轉化為乾淨、可被機器學習模型所理解的特徵，為後續建模與策略決策奠定堅實基礎。

第2章：數據基礎建設

第四章模型構建與策略洞察

聊天視窗

第3章 資料前處理與清洗

第3章資料前處理與清洗