返回目錄
A
Data Science Unlocked: A Practical Guide for Modern Analysts - 第 4 章
Chapter 4: Exploratory Data Analysis (EDA)
發布於 2026-02-23 16:15
# Chapter 4: Exploratory Data Analysis (EDA)
Exploratory Data Analysis (EDA) is the **first analytical step** that turns raw data into insights. It combines statistical summaries, visualizations, and interactive tools to uncover patterns, spot anomalies, test assumptions, and formulate hypotheses for downstream modeling.
---
## 4.1 Why EDA Matters
| Benefit | Description |
|---|---|
| **Rapid Insight** | Discover trends before committing to heavy modeling. |
| **Data Quality Check** | Spot missing values, outliers, and inconsistencies early. |
| **Feature Engineering** | Identify promising variables and relationships. |
| **Model Bias Prevention** | Detect skewed distributions that can bias models. |
| **Storytelling** | Build visual narratives that communicate findings to stakeholders. |
EDA bridges the gap between data ingestion and predictive analytics, ensuring that the data you feed into models is **clean, representative, and understood**.
---
## 4.2 Foundations: Statistical Summaries
Statistical descriptors are the backbone of any EDA session. They provide a quick snapshot of central tendency, dispersion, and shape.
python
import pandas as pd
# Load sample dataset
df = pd.read_csv('data/sample.csv')
# Summary statistics for numeric columns
print(df.describe(include='number'))
# Count of unique values for categorical columns
print(df.nunique())
Key metrics:
- **Mean, Median, Mode** – center
- **Std, IQR, Min/Max** – spread
- **Skew, Kurtosis** – distribution shape
- **Correlation matrix** – linear relationships
### Visual Complement: Box Plots & Histograms
python
import seaborn as sns
import matplotlib.pyplot as plt
sns.boxplot(data=df, orient='h')
plt.title('Box Plot of Numeric Features')
plt.show()
---
## 4.3 Univariate Analysis
### 4.3.1 Numeric Variables
| Plot | When to Use | Tool(s) |
|---|---|---|
| Histogram | Distribution shape | `matplotlib`, `seaborn`, `plotly` |
| Density Plot | Smooth distribution | `seaborn.kdeplot`, `plotly.express.histogram` |
| Box Plot | Outlier detection | `seaborn.boxplot`, `plotly.box` |
python
# Histogram with KDE
sns.histplot(df['age'], kde=True)
plt.title('Age Distribution')
plt.show()
### 4.3.2 Categorical Variables
| Plot | Purpose | Tool(s) |
|---|---|---|
| Bar Chart | Frequency counts | `seaborn.countplot`, `plotly.express.bar` |
| Pie Chart | Proportion view | `matplotlib.pyplot.pie` |
| Heatmap (count matrix) | Cross‑tab of two categoricals | `seaborn.heatmap` |
python
# Bar chart for a categorical variable
sns.countplot(x='country', data=df)
plt.title('Customer Distribution by Country')
plt.show()
---
## 4.4 Bivariate & Multivariate Analysis
Understanding pairwise relationships sets the stage for feature engineering.
### 4.4.1 Scatter Plots & Pair Plots
python
# Pair plot for selected features
sns.pairplot(df[['age', 'income', 'spend']], hue='segment')
plt.show()
### 4.4.2 Correlation Heatmaps
python
corr = df.corr()
sns.heatmap(corr, annot=True, cmap='coolwarm')
plt.title('Feature Correlation Matrix')
plt.show()
### 4.4.3 Categorical‑Numeric Relationships
python
sns.boxplot(x='segment', y='income', data=df)
plt.title('Income by Customer Segment')
plt.show()
---
## 4.5 Detecting Anomalies & Outliers
| Technique | How it Works | Typical Use‑Case |
|---|---|---|
| IQR Method | Values outside 1.5×IQR from Q1/Q3 | Detect extreme purchases |
| Z‑Score | Standard deviations from mean | Identify abnormal temperatures |
| Isolation Forest | Random partitioning trees | Fraud detection |
python
from scipy import stats
# Z‑Score outlier detection
df['zscore'] = stats.zscore(df['spend'])
outliers = df[abs(df['zscore']) > 3]
print(outliers.head())
---
## 4.6 Feature Importance & Dimensionality
While EDA focuses on data *before* modeling, it can hint at feature relevance:
- **Correlation thresholds**: Drop features with |r| < 0.1
- **Variance threshold**: Remove near‑constant variables
- **Multicollinearity**: Variance Inflation Factor (VIF) > 5 signals redundancy
python
from statsmodels.stats.outliers_influence import variance_inflation_factor
X = df.select_dtypes(include=['float64', 'int64'])
vif = pd.DataFrame()
vif['feature'] = X.columns
vif['VIF'] = [variance_inflation_factor(X.values, i) for i in range(X.shape[1])]
print(vif.sort_values('VIF', ascending=False))
---
## 4.7 Automating EDA Workflows
For large, recurring datasets, manual EDA becomes impractical. Automate with:
- **pandas‑profiling** or **Sweetviz** for auto‑generated reports
- **Tabulate** for CLI-friendly tables
- **Plotly Dash** or **Streamlit** for interactive dashboards
python
# Quick profile report
import pandas_profiling as pp
report = pp.ProfileReport(df, title='EDA Report')
report.to_file('report.html')
---
## 4.8 Interactive Dashboards
Stakeholders often prefer interactive exploration over static plots. Two popular frameworks:
| Framework | Strengths | Typical Stack |
|---|---|---|
| **Plotly Dash** | Production‑ready, supports callbacks | Python, Flask, Docker |
| **Streamlit** | Rapid prototyping, simple API | Python, pip, `streamlit run app.py` |
### Sample Streamlit Dashboard
python
import streamlit as st
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
st.title('Customer Spend Dashboard')
df = pd.read_csv('data/sample.csv')
# Sidebar filter
segment = st.sidebar.multiselect('Segment', df['segment'].unique(), default=df['segment'].unique())
filtered = df[df['segment'].isin(segment)]
# Plot
fig, ax = plt.subplots()
ax.hist(filtered['spend'], bins=30, color='steelblue')
ax.set_title('Spend Distribution')
st.pyplot(fig)
---
## 4.9 Best Practices Checklist
| Practice | Why It Matters |
|---|---|
| **Document assumptions** | Ensures reproducibility |
| **Keep a versioned data notebook** | Track changes over time |
| **Use consistent color palettes** | Enhances readability |
| **Validate with domain experts** | Aligns findings with business context |
| **Automate routine plots** | Saves time and reduces errors |
| **Publish dashboards to a secure gateway** | Protects sensitive data |
---
## 4.10 Case Study: Retail Sales Forecasting
**Scenario**: A mid‑size retailer wants to understand seasonal sales patterns and customer segmentation before building a forecasting model.
1. **Load data** – Sales transactions, customer profiles, and product metadata.
2. **Univariate plots** – Histogram of daily sales, KDE of customer tenure.
3. **Time‑series decomposition** – Seasonal component via `statsmodels.tsa.seasonal_decompose`.
4. **Correlation heatmap** – Identify which product attributes drive sales.
5. **Cluster customers** – K‑means on purchasing behavior (to be used later in segmentation).
6. **Dashboard** – Interactive timeline, heatmap, and cluster summary using Dash.
7. **Insights** – Peak sales in Q4, high‑value customers cluster with premium products.
These insights directly informed the feature set for the final forecasting model and guided marketing strategy.
---
## 4.11 Key Takeaways
- **EDA is exploratory, not prescriptive** – It informs, does not decide.
- **Visualization is power** – Good plots can uncover hidden patterns faster than tables.
- **Interactivity accelerates insight** – Dashboards allow stakeholders to drill down and validate findings.
- **Automation safeguards repeatability** – Use libraries like `pandas-profiling` and notebooks to lock in EDA steps.
- **Domain knowledge amplifies EDA** – Combine statistical signals with business context for actionable recommendations.
---
**Next Step**: With a solid EDA foundation, we move to Chapter 5 to harness supervised learning techniques and refine predictive models based on the insights uncovered.