3. 數據探索與可視化

發布於 2026-03-03 00:59

# 3. 數據探索與可視化 > **探索式分析（EDA）是資料科學流程中最關鍵的第一步**，它不僅能幫你快速了解資料的結構、分佈與關聯，還能在模型選擇前預先發現潛在問題，降低後續模型訓練與部署的風險。 --- ## 3.1 為何需要探索式分析？ | 目的 | 影響 | 具體實例 | |------|------|----------| | **快速定位品質問題** | 能即時發現遺失值、離群值、重複行等 | 在銀行貸款資料中，發現 15% 款項缺失，決定先補值再進行建模 | | **揭示業務洞察** | 透過統計描述或關聯分析得到關鍵指標 | 透過箱型圖發現信用卡使用頻率與客戶償還逾期呈現正相關 | | **選型輔助** | 為模型決策提供特徵性質的線索 | 發現特徵分布偏斜，決定使用樹模型而非線性迴歸 | | **溝通橋樑** | 以可視化說明資料特徵，促進跨部門共識 | 將相關熱力圖交給行銷團隊，說明關鍵行為模式 | --- ## 3.2 EDA 的四個階段 1. **資料載入與基本結構檢查** - `shape`, `info()`, `describe()` - 欄位類型與缺失值統計 2. **單變量分析** - 數值型：分佈、均值、中位數、四分位數 - 分類型：頻次、唯一值 3. **雙變量／多變量關聯分析** - 相關係數、卡方檢定、K‑means 初步分群 4. **時間序列與分組分析** - 趨勢、季節性、滯後特徵 --- ## 3.3 典型統計描述工具 | 功能 | 常用函式 | 典型輸出 | |------|----------|----------| | **描述性統計** | `df.describe()` | mean, std, min, 25%, 50%, 75%, max | | **遺失值統計** | `df.isna().sum()` | 欄位遺失值總數 | | **資料型別** | `df.dtypes` | 整數、浮點、分類、日期 | | **資料分佈** | `df['col'].value_counts()` | 類別頻率 | --- ## 3.4 視覺化工具與實務 ### 3.4.1 基本圖形 | 圖表 | 何時使用 | 典型參數 | |------|----------|----------| | **直方圖** (`plt.hist`) | 數值分佈 | `bins`, `density`, `color` | | **箱型圖** (`sns.boxplot`) | 檢測離群 | `x`, `y`, `whis` | | **核密度圖** (`sns.kdeplot`) | 平滑分佈 | `shade`, `bw_adjust` | | **條形圖** (`sns.countplot`) | 類別頻次 | `x`, `palette` | ### 3.4.2 進階圖形 | 圖表 | 何時使用 | 典型參數 | |------|----------|----------| | **散佈圖** (`sns.scatterplot`) | 兩變量關係 | `x`, `y`, `hue`, `size` | | **相關熱力圖** (`sns.heatmap`) | 數值關聯矩陣 | `annot`, `cmap`, `mask` | | **配對圖** (`sns.pairplot`) | 全域視覺 | `hue`, `kind` | | **時間序列折線圖** (`plt.plot`) | 時間趨勢 | `x`, `y`, `label` | | **面積圖** (`plt.stackplot`) | 分類累積趨勢 | `labels`, `stacking` | ### 3.4.3 互動式可視化 | 工具 | 優勢 | |------|------| | **Plotly** | 互動滑鼠、下拉選單 | | **Bokeh** | 大規模資料流式可視化 | | **Altair** | 宣告式簡潔語法 | --- ## 3.5 實務案例：信用卡逾期預測 > **資料集**：`credit_card_default.csv`（約 30,000 行，17 欄） ### 3.5.1 讀取資料與初步檢查 python import pandas as pd import seaborn as sns import matplotlib.pyplot as plt df = pd.read_csv('credit_card_default.csv') print(df.shape) print(df.info()) print(df.head()) ### 3.5.2 缺失值與型別調整 python # 缺失值統計 print(df.isna().sum()) # 轉換日期類別 if 'PAY_DATE' in df.columns: df['PAY_DATE'] = pd.to_datetime(df['PAY_DATE']) ### 3.5.3 單變量分析 python # 數值型分佈 numeric_cols = df.select_dtypes(include=['int64', 'float64']).columns for col in numeric_cols: sns.histplot(df[col], kde=True, bins=30) plt.title(f'{col} 分佈圖') plt.show() # 類別型頻次 categorical_cols = df.select_dtypes(include=['object', 'category']).columns for col in categorical_cols: sns.countplot(data=df, x=col) plt.title(f'{col} 頻次圖') plt.xticks(rotation=45) plt.show() ### 3.5.4 雙變量關聯 python # 相關係數矩陣 corr = df[numeric_cols].corr() plt.figure(figsize=(12, 10)) sns.heatmap(corr, annot=True, fmt='.2f', cmap='coolwarm', square=True) plt.title('相關係數熱力圖') plt.show() ### 3.5.5 分組統計 python # 逾期 vs. 非逾期的平均消費額 group = df.groupby('default_payment_next_month')['LIMIT_BAL'].mean() print(group) --- ## 3.6 進階技巧 | 技巧 | 目的 | 範例 | |------|------|------| | **多變量分箱** | 降維、提高解釋性 | `pd.qcut(df['BALANCE'], q=5, labels=False)` | | **自動化報告** | 版本控制、重現性 | `pandas_profiling.ProfileReport(df)` | | **樣本加權** | 調整樣本不平衡 | `class_weight='balanced'` in sklearn models | | **時間窗特徵** | 捕捉季節性 | `df['month'] = df['date'].dt.month` | --- ## 3.7 常見陷阱與解法 | 陷阱 | 可能造成 | 檢查方式 | 建議修正 | |------|----------|----------|----------| | **離群值誤判** | 破壞模型預測 | `boxplot` + `z-score` | 先視覺化再決定是否修正 | | **資料泄漏** | 模型過擬合 | 對測試集做同樣的EDA | 在訓練前就完成所有處理 | | **樣本偏差** | 結果無法外推 | `df.sample()` vs 目標分佈 | 重採樣或使用加權 | | **視覺化誤解** | 誤導決策 | 先驗證統計顯著性 | 只用視覺作輔助 | --- ## 3.8 小結 - **探索式分析** 是把資料轉化為商業洞察與模型假設的關鍵橋樑。 - 統計描述、相關分析、時間序列檢查與可視化應該成為每一次資料探索的標準流程。 - 掌握好視覺化工具與實務技巧，可大幅縮短前處理時間並提升模型品質。 > **延伸閱讀**： > - *《Python for Data Analysis》* (Wes McKinney) > - *《Storytelling with Data》* (Cole Nussbaumer Knaflic) > - *《Practical Statistics for Data Scientists》* (Peter Bruce, Andrew Bruce)

第二章：資料探索與清理—從原始資料到乾淨洞察

第 4 章：機器學習基礎