返回目錄
A
Data Science for the Analytical Mind: From Raw Data to Insightful Decisions - 第 4 章
Chapter 4: Exploratory Data Analysis & Visualization
發布於 2026-03-03 15:05
# Chapter 4: Exploratory Data Analysis & Visualization
Exploratory Data Analysis (EDA) is the compass that guides data scientists from raw data to meaningful insights. In this chapter we
- **Define** the core concepts and why EDA matters.
- **Walk through** step‑by‑step techniques using Python (pandas, matplotlib, seaborn, plotly).
- **Showcase** real‑world examples and pitfalls.
- **Provide** actionable best‑practice guidelines that can be applied immediately.
---
## 4.1 Why EDA Is Essential
| Dimension | Purpose | Typical Questions |
|-----------|---------|--------------------|
| **Understanding the data** | Discover what data you actually have | How many records? Which columns are missing? |
| **Identifying patterns** | Spot relationships and trends | Are sales increasing over time? |
| **Uncovering anomalies** | Detect outliers or data entry errors | Why does this transaction have an unusually high value? |
| **Hypothesis generation** | Formulate testable assumptions | Does higher advertising spend correlate with more conversions? |
EDA is not just a statistical exercise; it’s the *first conversation* between you and the dataset. A well‑executed EDA prevents wasted effort later in the pipeline and builds a strong foundation for modeling and communication.
---
## 4.2 Core Statistical Summaries
### 4.2.1 Univariate Statistics
Use **pandas** to compute descriptive statistics:
python
import pandas as pd
df = pd.read_csv('sales.csv')
print(df.describe())
The `describe()` method provides
- Count, mean, standard deviation, min, max
- 25th, 50th, and 75th percentiles (quartiles)
> **Tip**: For categorical columns, use `value_counts()` and `unique()` to understand distribution.
### 4.2.2 Missing Data Patterns
Visualize missingness with a heatmap or bar chart.
python
import seaborn as sns
import matplotlib.pyplot as plt
plt.figure(figsize=(12, 6))
sns.heatmap(df.isnull(), cbar=False, yticklabels=False)
plt.title('Missing Data Heatmap')
plt.show()
The heatmap immediately tells you which columns have gaps and whether missingness is random.
### 4.2.3 Outlier Detection
Outliers can distort model training.
python
import numpy as np
q1 = df['revenue'].quantile(0.25)
q3 = df['revenue'].quantile(0.75)
iqr = q3 - q1
upper = q3 + 1.5 * iqr
lower = q1 - 1.5 * iqr
outliers = df[(df['revenue'] > upper) | (df['revenue'] < lower)]
print('Number of outliers:', outliers.shape[0])
Visual tools like boxplots also help.
python
sns.boxplot(x=df['revenue'])
plt.title('Revenue Boxplot')
plt.show()
---
## 4.3 Distribution Visualizations
### 4.3.1 Histograms & Density Plots
python
sns.histplot(df['revenue'], kde=True, bins=30)
plt.title('Revenue Distribution')
plt.xlabel('Revenue')
plt.ylabel('Frequency')
plt.show()
Use KDE (Kernel Density Estimate) to smooth the distribution.
### 4.3.2 Empirical Cumulative Distribution Function (ECDF)
python
import statsmodels.api as sm
fig, ax = plt.subplots()
sm.distributions.ECDF(df['revenue'])(ax=ax)
ax.set_title('Revenue ECDF')
plt.show()
ECDFs are useful when comparing distributions across groups.
---
## 4.4 Bivariate Analysis
### 4.4.1 Scatter Plots
Identify linear or non‑linear relationships.
python
sns.scatterplot(x='price', y='units_sold', data=df)
plt.title('Price vs Units Sold')
plt.show()
### 4.4.2 Correlation Matrix
python
corr = df.corr()
plt.figure(figsize=(10, 8))
sns.heatmap(corr, annot=True, cmap='coolwarm')
plt.title('Correlation Matrix')
plt.show()
A correlation matrix highlights potential multicollinearity and informs feature selection.
### 4.4.3 Pair Plots
python
sns.pairplot(df[['price', 'units_sold', 'revenue']])
plt.show()
Pair plots give a quick visual of pairwise relationships.
---
## 4.5 Multivariate Techniques
### 4.5.1 Principal Component Analysis (PCA)
Reduce dimensionality while preserving variance.
python
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
features = df.select_dtypes(include=['float64', 'int64'])
scaled = StandardScaler().fit_transform(features)
pca = PCA(n_components=2)
principal_components = pca.fit_transform(scaled)
plt.scatter(principal_components[:, 0], principal_components[:, 1])
plt.title('PCA 2‑D Projection')
plt.xlabel('PC1')
plt.ylabel('PC2')
plt.show()
### 4.5.2 t‑SNE and UMAP
For highly non‑linear relationships.
python
import umap
umap_reducer = umap.UMAP(n_components=2, random_state=42)
embedding = umap_reducer.fit_transform(scaled)
plt.scatter(embedding[:, 0], embedding[:, 1])
plt.title('UMAP Projection')
plt.show()
---
## 4.6 Interactive Dashboards
While static plots are great for notebooks, interactive dashboards accelerate stakeholder exploration.
| Tool | Strength | Typical Use Case |
|------|----------|------------------|
| **Plotly Dash** | Full‑stack Python, easy to embed in web apps | Real‑time sales monitoring |
| **Streamlit** | Rapid prototyping, minimal boilerplate | Sharing quick insights with business teams |
| **Tableau / Power BI** | Drag‑and‑drop, powerful visual analytics | Enterprise‑wide reporting |
### Example: Streamlit Dashboard
python
# app.py
import streamlit as st
import pandas as pd
import plotly.express as px
df = pd.read_csv('sales.csv')
st.title('Sales Dashboard')
# Filters
region = st.sidebar.multiselect('Region', df['region'].unique(), default=df['region'].unique())
filtered = df[df['region'].isin(region)]
# Plot
fig = px.bar(filtered, x='month', y='revenue', color='product', barmode='group')
st.plotly_chart(fig)
Run with `streamlit run app.py`.
---
## 4.7 Best Practices & Pitfalls
| Practice | Why It Matters | How to Implement |
|----------|----------------|------------------|
| **Keep a reproducible notebook** | Enables audit and collaboration | Use Jupyter or Colab with clear code cells and markdown explanations |
| **Version‑control your data snapshots** | Avoid “data drift” during analysis | Store CSVs in a dedicated data folder and tag with dates |
| **Document assumptions** | Clarifies context for stakeholders | Add a “Data Assumptions” section in the notebook |
| **Avoid data leakage** | Prevents over‑optimistic results | Only use future data for predictive features |
| **Use consistent scales** | Easier comparison across plots | Standardize column names and units |
### Common Pitfalls
1. **Over‑plotting** – Too many points hide structure. Use transparency or hexbin plots.
2. **Misinterpreting correlation** – Correlation does not imply causation. Always corroborate with domain knowledge.
3. **Ignoring missingness** – Treat missing data as a signal, not noise.
4. **Skipping data checks** – Unchecked outliers can break models.
5. **Assuming normality** – Many datasets are skewed; use appropriate transformations.
---
## 4.8 Summary
- EDA is the *bridge* between raw data and actionable insights.
- Combine descriptive statistics, visualizations, and interactive tools for a holistic view.
- Keep analyses reproducible and well‑documented to scale.
- Use multivariate techniques to uncover hidden structure.
- Always validate findings with domain expertise.
By mastering these EDA techniques, analysts set the stage for robust modeling, trustworthy insights, and meaningful business impact.
---
> *“The first look at data often tells more than any model ever will.”* – **墨羽行**