返回目錄
A
Data Science for Decision Makers: Turning Numbers into Insight - 第 3 章
Chapter 3: Exploratory Data Analysis (EDA)
發布於 2026-02-24 12:29
# Chapter 3: Exploratory Data Analysis (EDA)
> **Key Insight:** *Exploratory Data Analysis is the laboratory where raw data is turned into hypotheses.*
---
## 1. What is EDA?
Exploratory Data Analysis (EDA) is a systematic, iterative process of probing a dataset to uncover its underlying structure, detect anomalies, and formulate preliminary hypotheses. Unlike confirmatory statistics, which tests predefined assumptions, EDA is *discover‑and‑explore*.
| EDA Element | Purpose | Typical Techniques |
|-------------|---------|---------------------|
| Descriptive statistics | Summarize the central tendency and dispersion | mean, median, mode, std, IQR |
| Data visualization | Reveal patterns, relationships, and outliers | histograms, boxplots, scatter plots, heatmaps |
| Data quality checks | Identify missingness, duplicates, and inconsistencies | missing‑value heatmap, duplicate rows count |
| Feature engineering | Transform raw variables into analytic features | scaling, encoding, interaction terms |
---
## 2. The EDA Workflow
1. **Define the Problem Context** – Clarify business questions.
2. **Load and Inspect the Data** – Quick look at dimensions, dtypes, and sample rows.
3. **Summarize Univariate Distributions** – Compute statistics, plot histograms.
4. **Explore Bivariate Relationships** – Scatter, bar, or violin plots.
5. **Detect and Treat Anomalies** – Outliers, missing values.
6. **Generate Feature Insights** – Correlation matrices, mutual information.
7. **Document Findings** – Narrative, markdown, or notebooks.
> *Tip:* Keep the workflow modular – write reusable functions or use Jupyter notebooks for transparency.
---
## 3. Hands‑On Example: Retail Sales Dataset
We’ll walk through a real‑world dataset: **`sales.csv`** – containing daily sales for 12 product categories over 2 years.
python
# 1️⃣ Load the data
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
sales = pd.read_csv('sales.csv', parse_dates=['date'])
print(sales.shape) # (731, 14)
### 3.1 Data Inspection
python
sales.head()
| date | product_id | category | store_id | sales | price | units | promo | ... |
|------------|------------|----------|----------|-------|-------|-------|-------|-----|
| 2021‑01‑01 | 101 | Food | 1 | 250 | 5.00 | 50 | 0 | ... |
- **Nulls**
python
sales.isna().sum()
text
sales 0
price 0
units 0
promo 0
...
- **Duplicates**
python
sales.duplicated().sum()
text
0
### 3.2 Univariate Analysis
#### 3.2.1 Summary Statistics
python
sales.describe(include='all')
Key insights:
- Mean sales per day: **$2,500**
- Standard deviation: **$1,200**
- 75th percentile of units sold: **75**
#### 3.2.2 Histograms & Density Plots
python
plt.figure(figsize=(12,4))
sns.histplot(sales['sales'], kde=True, bins=30)
plt.title('Distribution of Daily Sales')
plt.xlabel('Sales ($)')
plt.ylabel('Frequency')
plt.show()
The histogram reveals a **right‑skewed** distribution – typical for revenue data.
### 3.3 Bivariate Analysis
#### 3.3.1 Price vs. Units Sold
python
plt.figure(figsize=(6,6))
sns.scatterplot(data=sales, x='price', y='units', hue='category')
plt.title('Price vs. Units Sold by Category')
plt.show()
Observation: *Electronics* shows a weak negative correlation; *Food* displays a strong positive correlation.
#### 3.3.2 Sales Over Time
python
sales.set_index('date')['sales'].plot(figsize=(14,4))
plt.title('Daily Sales Trend')
plt.ylabel('Sales ($)')
plt.show()
Seasonality peaks in December, suggesting holiday promotion opportunities.
### 3.4 Outlier Detection & Treatment
python
# Boxplot for sales
sns.boxplot(x=sales['sales'])
plt.title('Boxplot of Sales')
plt.show()
Outliers: sales > $6,000 on a few days. Options:
- **Winsorize** (capping at 95th percentile)
- **Remove** if justified (e.g., data entry error)
python
sales['sales_cap'] = sales['sales'].clip(upper=sales['sales'].quantile(0.95))
### 3.5 Correlation & Heatmaps
python
numeric_cols = ['sales', 'price', 'units', 'promo']
corr = sales[numeric_cols].corr()
sns.heatmap(corr, annot=True, cmap='coolwarm')
plt.title('Correlation Matrix')
plt.show()
Positive correlation between *promo* and *sales* indicates promotions drive revenue.
### 3.6 Feature Engineering Ideas
| Feature | Rationale |
|---------|-----------|
| **sales_per_unit** | `sales / units` to capture price elasticity |
| **month** | Extract month for seasonal analysis |
| **is_holiday** | Flag holidays to study promotion impact |
python
sales['month'] = sales['date'].dt.month
sales['sales_per_unit'] = sales['sales'] / sales['units']
---
## 4. Best Practices for EDA
| Practice | Why it Matters |
|----------|----------------|
| **Use Reproducible Scripts** | Enables audit trails and collaboration |
| **Document Every Decision** | Clarifies assumptions when building models |
| **Iterate Quickly** | EDA is exploratory; rapid iterations reveal patterns faster |
| **Visualize with Context** | Include titles, axis labels, and legends to avoid misinterpretation |
| **Check Data Types Early** | Prevents errors in calculations (e.g., string numeric fields) |
| **Integrate Domain Knowledge** | Guides which relationships to probe |
---
## 5. Common Pitfalls and How to Avoid Them
| Pitfall | Consequence | Mitigation |
|----------|-------------|------------|
| **Over‑fitting to Visual Noise** | Misguided hypotheses | Use statistical tests to confirm visual patterns |
| **Ignoring Missing Data** | Biased estimates | Apply imputation or flag missingness |
| **Skipping Data Validation** | Garbage in, garbage out | Cross‑check against source systems |
| **Relying on a Single Metric** | Oversimplification | Combine multiple summary statistics |
| **Neglecting Time‑Series Properties** | Misinterpretation of trends | Plot with time axis, check for autocorrelation |
---
## 6. Tools & Libraries
| Library | Strengths |
|---------|-----------|
| **pandas** | Data manipulation & aggregation |
| **NumPy** | Fast numerical operations |
| **Matplotlib** | Customizable plots |
| **Seaborn** | Statistical visualizations & aesthetics |
| **Plotly** | Interactive plots for dashboards |
| **Sweetviz / pandas_profiling** | Auto‑generated EDA reports |
| **Great Expectations** | Data validation & quality checks |
---
## 7. Take‑Away Checklist
- [ ] Verify data types and missing values.
- [ ] Compute univariate statistics and visualize distributions.
- [ ] Explore pairwise relationships with scatter plots or correlation matrices.
- [ ] Identify outliers and decide on treatment.
- [ ] Generate new features aligned with business questions.
- [ ] Document insights in a narrative report.
- [ ] Review findings with domain experts before modeling.
---
## 8. Real‑World Case Study: Customer Retention for a SaaS Platform
**Context**: A SaaS company wants to predict churn. The dataset contains user activity logs, subscription details, and support tickets.
1. **Univariate**: High mean days‑active per month; skewness indicates a small group of highly active users.
2. **Bivariate**: Strong negative correlation between support tickets and retention.
3. **Time‑Series**: Notice a drop in activity during the first 30 days post‑signup.
4. **Outlier Handling**: Exclude users with anomalously high ticket counts due to system bugs.
5. **Feature Engineering**: Create `avg_session_duration`, `support_ticket_count`, and `days_since_last_login`.
6. **Result**: EDA guided the choice of a tree‑based model and highlighted the importance of early engagement metrics.
---
## 9. Next Steps
With a solid EDA foundation, you’re ready to move into **Statistical Inference** (Chapter 4) and **Predictive Modeling** (Chapter 5). The insights you’ve captured will shape feature selection, model choice, and evaluation strategies.
---
> *Remember:* *EDA is not a one‑off step; revisit it whenever you acquire new data or pivot the business question.*