返回目錄
A
Data Science Unveiled: From Raw Data to Insightful Decisions - 第 4 章
Chapter 4: Exploratory Data Analysis (EDA)
發布於 2026-03-06 20:30
# Chapter 4: Exploratory Data Analysis (EDA)
## 4.1 Why EDA Matters
Exploratory Data Analysis is the *first* investigative step after cleaning and before modeling. It turns raw numbers into visual stories, helps you:
- **Discover** hidden patterns, relationships, and distributions.
- **Validate** assumptions (e.g., normality, independence).
- **Spot** anomalies, outliers, and missing‑value regimes.
- **Generate** hypotheses that guide feature engineering and model choice.
- **Communicate** insights to stakeholders with reproducible visuals.
In short, EDA is both a science and an art: you rely on statistical tests, but you also read the *look* of your data.
---
## 4.2 Core Concepts & Terminology
| Term | Definition |
|------|------------|
| **Univariate** | Analysis of a single variable (e.g., histogram of age). |
| **Bivariate** | Joint analysis of two variables (e.g., scatter plot of income vs. spending). |
| **Multivariate** | Analysis involving three or more variables (e.g., pairplot or heatmap). |
| **Skewness** | Measure of asymmetry in a distribution. |
| **Kurtosis** | Measure of tail‑heaviness; tells you if data have outliers. |
| **Correlation** | Strength and direction of linear relationship (Pearson) or monotonic relationship (Spearman). |
| **Covariance** | Unstandardized measure of joint variability. |
| **Missing Completely At Random (MCAR)** | Probability of missingness independent of data. |
| **Missing At Random (MAR)** | Missingness depends on observed data but not on unobserved data. |
| **Missing Not At Random (MNAR)** | Missingness depends on unobserved data. |
---
## 4.3 Toolset Overview
| Library | Primary Use | Strength |
|---------|-------------|----------|
| **Matplotlib** | Base plotting, fine‑grained control | Mature, highly customizable |
| **Seaborn** | Statistical visualizations built on Matplotlib | Built‑in themes, concise syntax |
| **Plotly** | Interactive, web‑ready plots | Hover info, 3‑D plots, easy sharing |
| **Missingno** | Visualizing missing‑value patterns | Quick heatmap, bar, matrix |
| **Pandas** | Data handling & aggregation | Powerful groupby & descriptive stats |
---
## 4.4 Visual Exploration
### 4.4.1 Univariate Analysis
python
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
df = pd.read_csv('titanic.csv')
# Histogram & KDE
sns.histplot(df['Age'].dropna(), kde=True, color='steelblue')
plt.title('Age Distribution')
plt.show()
**Takeaway:** Inspect shape, center, and spread. If the histogram is heavily skewed, consider transformation.
### 4.4.2 Bivariate Analysis
python
# Scatter plot with regression line
sns.lmplot(x='Fare', y='Survived', data=df, height=6, aspect=1.2, ci=None)
plt.title('Fare vs. Survival')
plt.show()
Use **pairplot** for many variables:
python
sns.pairplot(df[['Pclass', 'Sex', 'Age', 'Survived']], hue='Survived')
plt.show()
### 4.4.3 Multivariate Analysis
python
# Correlation heatmap
corr = df.corr()
sns.heatmap(corr, annot=True, cmap='coolwarm', fmt='.2f')
plt.title('Correlation Matrix')
plt.show()
**Interactive Dashboards** – with Plotly Dash or Streamlit for stakeholders.
---
## 4.5 Statistical Summaries
| Statistic | How to compute | When to use |
|-----------|----------------|--------------|
| Mean | `df.mean()` | Central tendency for symmetric data |
| Median | `df.median()` | Robust to outliers |
| Standard Deviation | `df.std()` | Dispersion |
| Skewness | `df.skew()` | Detect asymmetry |
| Kurtosis | `df.kurtosis()` | Heavy‑tailedness |
| Correlation | `df.corr(method='pearson')` | Linear relationships |
| Chi‑square | `scipy.stats.chi2_contingency()` | Categorical association |
Example: Computing descriptive stats for numerical columns:
python
print(df.describe().T[['mean', 'std', 'min', '25%', '50%', '75%', 'max']])
---
## 4.6 Missingness & Outliers
### 4.6.1 Visualizing Missingness
python
import missingno as msno
msno.matrix(df)
plt.title('Missingness Matrix')
plt.show()
### 4.6.2 Outlier Detection
| Method | Formula | Typical Threshold |
|--------|---------|-------------------|
| IQR | `Q3 - Q1` | 1.5 * IQR above Q3 or below Q1 |
| Z‑score | `(x - μ)/σ` | |z| > 3 |
| Robust | Median ± 1.5 * MAD | Robust to extreme values |
Example using IQR:
python
Q1 = df['Fare'].quantile(0.25)
Q3 = df['Fare'].quantile(0.75)
IQR = Q3 - Q1
lower = Q1 - 1.5 * IQR
upper = Q3 + 1.5 * IQR
outliers = df[(df['Fare'] < lower) | (df['Fare'] > upper)]
print('Number of outliers:', outliers.shape[0])
---
## 4.7 Time‑Series Specific EDA
1. **Decompose** trend, seasonality, residuals (statsmodels `seasonal_decompose`).
2. **Autocorrelation** (pandas `autocorr`, statsmodels `plot_acf`).
3. **Stationarity** tests (ADF test).
4. **Seasonal Subseries Plot**.
Example:
python
import statsmodels.api as sm
ts = df.set_index('Date')['Sales']
decomp = sm.tsa.seasonal_decompose(ts, model='multiplicative')
decomp.plot()
plt.show()
---
## 4.8 Reproducible EDA Workflow
| Step | Implementation | Notes |
|------|----------------|-------|
| 1 | Load data with a fixed seed | `np.random.seed(42)` |
| 2 | Version‑control notebooks | Git + Data Version Control (DVC) |
| 3 | Store plots in a dedicated folder | `figures/` |
| 4 | Automate summary stats | `df.describe().to_csv('summary.csv')` |
| 5 | Use `ipywidgets` for interactive filters | Great for demos |
| 6 | Document assumptions | Inline Markdown cells |
---
## 4.9 Common Pitfalls & How to Avoid Them
| Pitfall | Why it’s problematic | Remedy |
|---------|--------------------|--------|
| **Correlation ≠ Causation** | Mistaking a spurious relationship for a causal link | Check domain knowledge, control variables |
| **Data Leakage during EDA** | Using future information (e.g., mean of entire dataset) before training | Split data first, compute stats on training set |
| **Over‑fitting to Visuals** | Tailoring features to specific plots, ignoring broader context | Validate patterns with statistical tests |
| **Ignoring Context** | Treating numbers as isolated | Keep business problem at the core |
| **Neglecting Missingness Mechanism** | Imputing blindly without understanding MCAR/MAR/MNAR | Perform missingness tests, report assumptions |
---
## 4.10 Summary & Key Takeaways
1. **EDA is exploratory, not prescriptive.** Use it to guide decisions, not to decide everything.
2. **Visuals + stats** – Combine plots with descriptive statistics for robust insights.
3. **Document everything** – Reproducibility is the backbone of a trustworthy data science project.
4. **Respect data types** – Tailor techniques to numerical, categorical, datetime, or text data.
5. **Keep an eye on ethics** – Visual misrepresentations can mislead stakeholders and propagate bias.
6. **Iterate** – EDA is a loop: refine plots, discover new variables, revisit hypotheses.
> *“A well‑crafted EDA is the compass that points data scientists toward the most promising directions in the data‑land.”* – 墨羽行
---
## 4.11 Further Reading & Resources
| Resource | Focus |
|----------|-------|
| *Python Data Science Handbook* – Jake VanderPlas | Practical code examples |
| *Data Visualization with ggplot2* – Hadley Wickham | Theoretical foundations for visual encoding |
| *An Introduction to Statistical Learning* – Gareth James et al. | Correlation, hypothesis testing |
| Plotly Docs | Interactive plot features |
| Missingno Docs | Visualizing missing data |
| Statsmodels API | Time‑series decomposition |
---
## 4.12 Hands‑On Exercise
1. Download the **Adult Income** dataset from UCI.
2. Perform the following EDA steps:
- Univariate plots for each numeric feature.
- Correlation heatmap; identify the strongest predictor of income.
- Visualize missingness patterns.
- Detect outliers in *Hours-per-week*.
3. Compile a 2‑page report with plots, tables, and a brief narrative.
4. Share your Jupyter notebook on GitHub, including a `requirements.txt`.
Good luck, and remember: *The goal of EDA is to ask the right questions, not just to show charts.*