返回目錄
A
Data Science for the Modern Analyst: From Data to Insight - 第 3 章
Chapter 3: Turning Clean Data into Insight—Exploratory Data Analysis
發布於 2026-03-04 13:50
# Chapter 3: Turning Clean Data into Insight—Exploratory Data Analysis
After a rigorous cleaning pipeline, the dataset is ready for interrogation. This chapter equips you with the toolkit to uncover patterns, validate assumptions, and surface hypotheses that will drive the modeling phase. We’ll blend statistical rigor with hands‑on Python, ensuring every step is reproducible and auditable.
---
## 3.1 Overview
| Step | Goal | Key Tools |
|------|------|-----------|
| 3.1.1 | Compute summary statistics | `pandas.describe()` |
| 3.1.2 | Visualize distributions | Seaborn, Matplotlib |
| 3.1.3 | Explore relationships | Pairplot, heatmap |
| 3.1.4 | Detect anomalies | Boxplots, Isolation Forest |
| 3.1.5 | Assess multicollinearity | Correlation matrix, VIF |
| 3.1.6 | Document findings | Jupyter notebook, Markdown |
Each subsection below details the process, complete with code snippets that you can copy‑paste and adapt.
---
## 3.2 Summary Statistics: The First Glimpse
The classic `DataFrame.describe()` gives a quick snapshot of central tendency, spread, and shape.
```python
import pandas as pd
df = pd.read_csv("/data/cleaned/customer_transactions.csv")
print(df.describe(include=["all"]))
```
### Key Takeaways
1. **Skewness & Kurtosis**: A high skewness (> 1) signals a long tail; kurtosis > 3 indicates heavy tails.
2. **Missingness Patterns**: `describe()` will flag columns with null counts—re‑visit imputation if necessary.
3. **Data Types**: Confirm that categorical columns are `object` or `category` and numeric columns are `float64`/`int64`.
---
## 3.3 Univariate Analysis: Digging Into One Variable at a Time
Univariate analysis uncovers distributional properties and outlier presence.
### 3.3.1 Histograms & Density Plots
```python
import seaborn as sns
import matplotlib.pyplot as plt
plt.figure(figsize=(10, 4))
sns.histplot(df["purchase_amount"], kde=True, bins=30, color="steelblue")
plt.title("Purchase Amount Distribution")
plt.xlabel("$Amount")
plt.ylabel("Frequency")
plt.show()
```
### 3.3.2 Boxplots
```python
plt.figure(figsize=(8, 6))
sns.boxplot(x=df["age"], color="lightgreen")
plt.title("Age Distribution by Boxplot")
plt.xlabel("Age")
plt.show()
```
**Interpretation**:
- **Outliers**: Points beyond 1.5×IQR may warrant investigation.
- **Symmetry**: A symmetric histogram suggests normality; otherwise consider log‑transformations.
---
## 3.4 Bivariate & Multivariate Exploration
When a single variable is insufficient, we turn to pairwise relationships.
### 3.4.1 Scatter Plots & Pairplot
```python
sns.pairplot(df, vars=["age", "income", "purchase_amount"], hue="customer_segment")
plt.show()
```
### 3.4.2 Correlation Heatmap
```python
corr = df.corr(method="pearson")
plt.figure(figsize=(12, 10))
sns.heatmap(corr, annot=True, fmt=".2f", cmap="coolwarm", vmin=-1, vmax=1)
plt.title("Correlation Matrix")
plt.show()
```
**Key Concepts**:
- **Multicollinearity**: Correlations > 0.8 between predictors can inflate variance; check Variance Inflation Factor (VIF).
- **Feature Importance Insight**: A strong positive correlation between `income` and `purchase_amount` may justify a linear model.
---
## 3.5 Detecting Anomalies & Outliers
Beyond visual checks, algorithmic detectors provide statistical thresholds.
```python
from sklearn.ensemble import IsolationForest
clf = IsolationForest(contamination=0.01, random_state=42)
df["anomaly_score"] = clf.fit_predict(df.select_dtypes(include=["float64", "int64"]))
anomalies = df[df["anomaly_score"] == -1]
print(f"Detected {len(anomalies)} anomalies.")
```
**Why This Matters**:
- Anomalies can distort models and lead to misleading insights.
- They often reveal data entry errors, fraud, or rare but significant events.
---
## 3.6 Feature Engineering Insights from EDA
EDA is not just a diagnostic tool; it seeds feature creation.
| Insight | Potential Feature | Rationale |
|---------|-------------------|-----------|
| Skewed `purchase_amount` | Log‑transformed `log_purchase_amount` | Stabilizes variance |
| Strong `age`–`purchase_amount` trend | Interaction `age * income` | Captures compound effects |
| Periodic spikes in `transaction_date` | Extracted `month_of_year` | Models seasonality |
```python
df['log_purchase_amount'] = np.log1p(df['purchase_amount'])
```
---
## 3.7 Reproducibility in Exploratory Work
*Documentation is key.* Wrap plots in functions, capture figure metadata, and version your notebooks.
```python
# utils/eda_plot.py
import matplotlib.pyplot as plt
def histogram(series, title, xlabel, ylabel, bins=30):
plt.figure(figsize=(10, 4))
plt.hist(series, bins=bins, color="steelblue", alpha=0.7)
plt.title(title)
plt.xlabel(xlabel)
plt.ylabel(ylabel)
plt.tight_layout()
return plt
```
Call from notebook:
```python
from utils.eda_plot import histogram
histogram(df["purchase_amount"], "Purchase Amount", "$Amount", "Frequency")
plt.savefig("/reports/figs/purchase_amount_hist.png", dpi=300)
```
Versioning tools like DVC can lock the exact snapshot of the dataset used in this EDA.
---
## 3.8 Ethical Lens: Bias and Fairness in EDA
When exploring demographic columns (`age`, `gender`, `region`), be vigilant for disparate patterns.
- **Segment Disparities**: Plot `purchase_amount` by `customer_segment` to check if one group is under‑represented.
- **Missingness Bias**: Columns with high missing rates for a particular demographic may signal sampling bias.
```python
sns.boxplot(x="customer_segment", y="purchase_amount", data=df)
plt.title("Purchase Amount by Segment")
plt.show()
```
If bias is detected, document it and consider adjusting downstream modeling pipelines.
---
## 3.9 Checklist: Do You Have It All?
- [ ] Summary statistics computed and reviewed.
- [ ] Univariate plots generated for each numeric column.
- [ ] Bivariate relationships mapped (scatter / heatmap).
- [ ] Outliers flagged (visual and algorithmic).
- [ ] Feature engineering hypotheses documented.
- [ ] Plots saved with metadata and reproducible code.
- [ ] Ethical review of demographic patterns completed.
Completing this checklist ensures that the dataset is not just clean but *understood*—the essential precursor to robust modeling.
---
## 3.10 Take‑Away
Exploratory Data Analysis turns raw numbers into a story. It validates assumptions, surfaces anomalies, and sparks feature ideas. By embedding reproducibility and ethical scrutiny into each step, you set a firm foundation for the next chapter: building predictive models that are accurate, fair, and actionable.