9️⃣ 資料清理與探索式分析（Clean‑EDA）

發布於 2026-02-25 05:39

## 3️⃣ 資料清理與探索：從雜沓到洞察 > **開場**：資料是洞察的土壤，但如同雜草叢生，若不先除雜，將無法得到純淨的結晶。以下將帶你從資料清理的基礎做起，進而用探索式分析開啟洞察之門。 ### 1️⃣ 先做資料品質評估 > 在開始清理前，先用 **Great Expectations** 建立 Expectation Suite，並以 Pandas 進行描述統計與資料類型檢查。 python import pandas as pd import great_expectations as ge df = pd.read_csv('data.csv') # 轉為 ge.DataFrame ge_df = ge.from_pandas(df) # 建立 Expectation Suite expectation_suite = ge_df.create_expectation_suite(name='my_suite', overwrite_existing=True) # 進行驗證 results = ge_df.validate(expectation_suite) print(results) ### 2️⃣ 資料缺失值處理缺失值若未處理，將影響後續模型。可使用 `isnull().sum()` 檢查，並以填補或刪除方式處理。 python # 了解缺失 print(df.isnull().sum()) # 用中位數填補 df['feature'] = df['feature'].fillna(df['feature'].median()) ### 3️⃣ 離群值偵測使用箱型圖或 Z-score，找出極端值，並決定是否刪除或調整。 python # 以箱型圖檢查 import seaborn as sns sns.boxplot(data=df['feature']) # Z-score from scipy import stats z_scores = stats.zscore(df['feature']) print(df[z_scores.abs() > 3]) ### 4️⃣ 類別型特徵編碼 - One-Hot 編碼：`pd.get_dummies()`。 - Label 編碼：`sklearn.preprocessing.LabelEncoder`。 ### 5️⃣ 時間序列前處理 python # 轉成 datetime df['date'] = pd.to_datetime(df['date']) # 以日為頻率重採樣 df.set_index('date').resample('D').mean().reset_index() ### 6️⃣ EDA（探索式分析） > **分段**：先進行描述統計，視覺化檢查分佈與離群。 python print(df.describe()) corr = df.corr() sns.heatmap(corr, annot=True, cmap='coolwarm') ### 7️⃣ 視覺化技巧 - 條形圖：`sns.barplot()`。 - 散點圖：`sns.scatterplot()`。 - 熱力圖：`sns.heatmap()`。 - 互動圖：`plotly.express`。 python import plotly.express as px fig = px.scatter(df, x='amount', y='discount', color='category') fig.show() ### 8️⃣ 持續監控資料品質 - 以 Airflow DAG 定時執行 Expectation Suite。 - 若缺失率超過閾值，透過 Slack webhook 觸發警報。 - 將處理結果寫入元資料庫，保持可追蹤。 ### 9️⃣ 小結 - 資料清理是資料科學之基石；若基礎不穩，模型會失效。 - 透過自動化流程（Airflow、DBT、Great Expectations）可確保清洗重現性。 - 探索式分析讓我們從雜沓中抽絲剝繭，發現商業洞察。 > **下一步**：在下一章「建模與評估」中，我們將把這些乾淨的特徵送進機器學習管道，並評估其預測效能。

第二章：資料蒐集與整合

第四章探索性資料分析（EDA）