返回目錄
A
Data Science Demystified: A Pragmatic Guide for Business Decision-Makers - 第 3 章
Chapter 3: Exploratory Data Analysis (EDA)
發布於 2026-02-23 09:24
# Chapter 3: Exploratory Data Analysis (EDA)
## 3.1 Why EDA Matters in Business
- **Discover hidden patterns** that can turn into actionable insights.
- **Validate assumptions** before building models (e.g., linearity, homoscedasticity).
- **Identify data quality issues** such as outliers, missing values, or skewed distributions.
- **Guide feature engineering** and model selection by revealing variable importance.
> *Business leaders often ask, “What can we learn from this data?” EDA is the answer‑tool that turns raw tables into narratives.*
## 3.2 Core Steps of an EDA Pipeline
| Step | Goal | Typical Python Tools |
|------|------|---------------------|
| 1. **Load & Inspect** | Verify schema, data types, and head rows | `pandas.read_csv`, `pandas.DataFrame.info()` |
| 2. **Summarise** | Statistical overview (mean, std, percentiles) | `pandas.DataFrame.describe()`, `scipy.stats` |
| 3. **Visualise** | Understand distributions & relationships | `seaborn`, `matplotlib`, `plotly` |
| 4. **Transform** | Handle outliers, missingness, scaling | `scikit‑learn` preprocessing, custom functions |
| 5. **Reduce Dimensionality** | Simplify high‑dimensional data | `pca`, `t-SNE`, `UMAP` |
| 6. **Document** | Keep reproducible notebooks, version‑control scripts | Git, Jupyter Notebooks, VSCode Live Share |
### 3.2.1 Example: Load & Inspect
```python
import pandas as pd
df = pd.read_csv('sales_data.csv')
print(df.info())
print(df.head())
```
### 3.2.2 Example: Summary Statistics
```python
summary = df.describe().T
print(summary[['mean','std','min','25%','50%','75%','max']])
```
## 3.3 Visualization Strategies
| Category | Purpose | Recommended Plot | Library |
|----------|---------|------------------|---------|
| **Univariate** | Show distribution of a single variable | Histogram, KDE, Box plot | `seaborn`, `matplotlib` |
| **Bivariate** | Explore relationship between two variables | Scatter, Line, Correlation heatmap | `seaborn`, `plotly` |
| **Multivariate** | Visualise interactions in higher dimensions | Parallel Coordinates, Pair Plot | `seaborn`, `plotly.express` |
| **Temporal** | Detect seasonality or trends | Line plot, Rolling stats | `matplotlib`, `pandas.plotting` |
### 3.3.1 Univariate Example
```python
import seaborn as sns
import matplotlib.pyplot as plt
sns.histplot(df['revenue'], kde=True, bins=30)
plt.title('Revenue Distribution')
plt.xlabel('Revenue ($)')
plt.ylabel('Frequency')
plt.show()
```
### 3.3.2 Bivariate Example
```python
sns.scatterplot(data=df, x='advertising_spend', y='revenue', hue='region')
plt.title('Revenue vs Advertising Spend by Region')
plt.show()
```
## 3.4 Statistical Summaries & Diagnostic Tests
| Metric | Interpretation | Code Snippet |
|--------|----------------|--------------|
| **Correlation** | Strength & direction of linear relationship | `df.corr()` |
| **Skewness / Kurtosis** | Deviation from normality | `scipy.stats.skew(df['col'])` |
| **Shapiro–Wilk Test** | Test normality | `scipy.stats.shapiro(df['col'])` |
| **Kolmogorov–Smirnov Test** | Compare distributions | `scipy.stats.ks_2samp(df['col1'], df['col2'])` |
Example: Check normality of `age` column
```python
from scipy import stats
stat, p = stats.shapiro(df['age'])
print(f'Shapiro-Wilk p-value: {p:.4f}')
if p < 0.05:
print('Data likely not normal.')
else:
print('Cannot reject normality.')
```
## 3.5 Dimensionality Reduction Techniques
| Technique | When to Use | Typical Implementation |
|-----------|-------------|------------------------|
| **PCA** | Linear relationships, large feature sets | `sklearn.decomposition.PCA` |
| **t‑SNE** | Non‑linear, visualise clusters | `sklearn.manifold.TSNE` |
| **UMAP** | Preserve global structure, faster than t‑SNE | `umap-learn` |
### 3.5.1 PCA Example
```python
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
features = df[['feature1','feature2','feature3','feature4']]
scaler = StandardScaler()
scaled = scaler.fit_transform(features)
pca = PCA(n_components=2)
principal_components = pca.fit_transform(scaled)
plt.figure(figsize=(8,6))
plt.scatter(principal_components[:,0], principal_components[:,1])
plt.xlabel('PC1')
plt.ylabel('PC2')
plt.title('PCA of Product Features')
plt.show()
```
## 3.6 EDA in the Context of Model Selection
| Data Insight | Model Implication |
|---------------|-------------------|
| **Linear relationship** | Consider linear regression, Lasso, Ridge |
| **Strong multicollinearity** | Use regularisation or dimensionality reduction |
| **Non‑linear patterns** | Tree‑based methods (Random Forest, XGBoost) or neural nets |
| **Class imbalance** | Imbalanced‑classification techniques (SMOTE, class weights) |
**Case Study – Retail Forecasting**
1. **EDA found** a monthly seasonal trend and a weak linear trend in sales.
2. **Chosen model**: SARIMA (Seasonal ARIMA) because it captures both trend and seasonality.
3. **Result**: 12% reduction in forecast error versus a naïve rolling‑average model.
## 3.7 Reproducible EDA Practices
| Practice | Rationale | Tool |
|----------|-----------|------|
| **Notebook version control** | Audit trail of analysis | Git, Jupyter Notebook extensions |
| **Seed management** | Consistent random splits | `numpy.random.seed()`, `scikit-learn.set_random_state()` |
| **Automated reports** | Stakeholder communication | `nbconvert`, `papermill`, `DataDog` dashboards |
| **Data lineage tracking** | Trace back from insights to raw files | `great_expectations`, `dbt` |
```bash
# Example: Commit notebook after each EDA session
git add sales_eda.ipynb
git commit -m "Add initial EDA with outlier handling"
```
## 3.8 Interactive and Production‑Ready EDA
- **Dashboards**: `Plotly Dash`, `Streamlit` for real‑time exploration.
- **Automated EDA tools**: `pandas-profiling`, `sweetviz` generate full reports with one line.
- **Integration with ML pipelines**: Store EDA artifacts (plots, summary tables) as model artifacts in MLflow or DVC.
```python
import sweetviz as sv
report = sv.analyze(df)
report.show_html('eda_report.html')
```
## 3.9 Take‑Away Checklist for Business Leaders
- **Ask**: What are the business questions we want to answer? EDA should align with them.
- **Validate**: Are the data patterns stable over time? Re‑run EDA periodically.
- **Document**: Keep a versioned record of the EDA notebook and key findings.
- **Communicate**: Translate plots into business‑relevant narratives.
- **Iterate**: Use insights to refine data collection and feature engineering.
---
*In the next chapter, we’ll translate the insights from this EDA into robust, supervised learning models.*