返回目錄
A
Data Science Mastery: From Fundamentals to Impactful Insights - 第 3 章
Chapter 3: Exploratory Data Analysis & Visualization
發布於 2026-02-28 20:42
# Chapter 3: Exploratory Data Analysis & Visualization
Exploratory Data Analysis (EDA) is the compass that guides every data‑science project. Before building a model or writing a report, you need to understand the data’s shape, its quirks, and the stories it can tell. This chapter walks you through the full EDA workflow, from data understanding to interactive dashboards, and shows how to turn raw tables from a PostgreSQL database into compelling visual narratives.
---
## 3.1 Data Understanding
### 3.1.1 Schema Overview
When you finish the cleaning phase in Chapter 2, your PostgreSQL database should look something like this:
| Table | Columns (example) | Data Types |
|---------|--------------------|------------|
| **orders** | order_id, customer_id, order_date, total_amount | INT, INT, TIMESTAMP, DECIMAL |
| **products** | product_id, category, price, stock | INT, TEXT, DECIMAL, INT |
| **order_items** | item_id, order_id, product_id, quantity | INT, INT, INT, INT |
> **Why it matters** – Knowing the schema helps you write efficient SQL joins and avoid costly data‑type mismatches.
### 3.1.2 Data Connectivity
python
import pandas as pd
from sqlalchemy import create_engine
engine = create_engine(
"postgresql+psycopg2://user:password@localhost:5432/ecommerce"
)
orders = pd.read_sql("SELECT * FROM orders", engine)
products = pd.read_sql("SELECT * FROM products", engine)
order_items = pd.read_sql("SELECT * FROM order_items", engine)
> **Tip** – Keep a `README.md` in your project folder that records the database URI, credentials (use environment variables), and the SQL schema. This becomes invaluable when you re‑run analyses or hand off the project to a teammate.
---
## 3.2 Summary Statistics
| Measure | Function | Example |
|---------|----------|---------|
| **Count** | `df.shape[0]` | 1,024,000 rows |
| **Mean** | `df.mean()` | 59.73 (average order total) |
| **Std** | `df.std()` | 42.15 |
| **Min / Max** | `df.min()`, `df.max()` | 1.00 / 999.99 |
| **Quantiles** | `df.quantile([0.25, 0.5, 0.75])` | 25th / 50th / 75th percentiles |
### 3.2.1 Data Types & Nulls
python
# Detect missing values
missing_summary = df.isna().sum().sort_values(ascending=False)
print(missing_summary)
### 3.2.2 Feature‑Level Descriptions
python
desc = df.describe(include='all')
print(desc.head())
> **Why it matters** – Summary stats surface anomalies (e.g., a negative `price`) early, and give you a baseline for visual checks.
---
## 3.3 Distribution Visualizations
| Distribution | Plot | Library |
|--------------|------|---------|
| **Histogram** | `plt.hist()` | Matplotlib |
| **Kernel Density** | `sns.kdeplot()` | Seaborn |
| **Boxplot** | `sns.boxplot()` | Seaborn |
| **Bar Chart** | `plt.bar()` | Matplotlib |
### 3.3.1 Example: Order Total
python
import matplotlib.pyplot as plt
import seaborn as sns
plt.figure(figsize=(10, 4))
sns.histplot(orders['total_amount'], bins=50, kde=True)
plt.title('Distribution of Order Totals')
plt.xlabel('Total Amount')
plt.ylabel('Frequency')
plt.show()
> **Interpretation** – A right‑skewed histogram signals a small number of high‑value orders, which may be worth targeted marketing.
---
## 3.4 Relationship Analysis
### 3.4.1 Correlation Matrix
python
corr = orders.corr()
plt.figure(figsize=(8, 6))
sns.heatmap(corr, annot=True, cmap='coolwarm')
plt.title('Feature Correlation')
plt.show()
### 3.4.2 Pair Plots
python
sns.pairplot(orders[['total_amount', 'price', 'quantity']])
plt.show()
### 3.4.3 Group‑by Aggregations
python
grouped = orders.groupby('category')['total_amount'].agg(['mean', 'sum', 'count']).reset_index()
print(grouped.head())
> **Why it matters** – Correlations help you decide which features to keep, drop, or engineer, and group‑by stats reveal which product categories drive revenue.
---
## 3.5 Missingness & Outlier Detection
| Technique | Purpose | Example |
|-----------|---------|---------|
| **Heatmap** | Visual missingness pattern | `sns.heatmap(df.isna(), cbar=False)` |
| **Isolation Forest** | Detect anomalies | `from sklearn.ensemble import IsolationForest` |
| **Z‑score** | Simple outlier threshold | `np.abs(stats.zscore(df['price'])) > 3` |
### 3.5.1 Missingness Heatmap
python
plt.figure(figsize=(12, 6))
sns.heatmap(df.isna(), cbar=False)
plt.title('Missing Data Heatmap')
plt.show()
### 3.5.2 Isolation Forest
python
from sklearn.ensemble import IsolationForest
iso = IsolationForest(contamination=0.01, random_state=42)
pred = iso.fit_predict(df[['price', 'quantity']])
df['anomaly'] = pred
print(df[df['anomaly'] == -1].head())
> **Tip** – Document every anomaly flag; sometimes what looks like an outlier is a valid edge case (e.g., a wholesale order).
---
## 3.6 Temporal Patterns
| Time Feature | How to Extract | Library |
|--------------|----------------|---------|
| **Month** | `df['order_date'].dt.month` | Pandas |
| **Day of Week** | `df['order_date'].dt.dayofweek` | Pandas |
| **Hour** | `df['order_date'].dt.hour` | Pandas |
### 3.6.1 Sales by Month
python
orders['month'] = orders['order_date'].dt.to_period('M')
monthly_sales = orders.groupby('month')['total_amount'].sum()
monthly_sales.plot(kind='bar', figsize=(10, 4))
plt.title('Monthly Sales')
plt.ylabel('Revenue')
plt.show()
> **Why it matters** – Seasonal trends inform inventory planning and promotional calendars.
---
## 3.7 Dimensionality Reduction & Clustering
### 3.7.1 Principal Component Analysis (PCA)
python
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
scaled = scaler.fit_transform(df[['price', 'quantity', 'stock']])
pca = PCA(n_components=2)
principal = pca.fit_transform(scaled)
plt.scatter(principal[:, 0], principal[:, 1])
plt.title('PCA of Product Features')
plt.xlabel('PC1')
plt.ylabel('PC2')
plt.show()
### 3.7.2 K‑Means Clustering
python
from sklearn.cluster import KMeans
kmeans = KMeans(n_clusters=4, random_state=42)
clusters = kmeans.fit_predict(principal)
plt.scatter(principal[:, 0], principal[:, 1], c=clusters, cmap='viridis')
plt.title('K‑Means Product Clusters')
plt.show()
> **Case Study** – In Chapter 4 you’ll see how these clusters translate into targeted recommendation engines.
---
## 3.8 Interactive Dashboards
### 3.8.1 Plotly Express
python
import plotly.express as px
fig = px.scatter(df, x='price', y='total_amount', color='category', hover_data=['product_id'])
fig.update_layout(title='Price vs. Order Total by Category')
fig.show()
### 3.8.2 Streamlit
python
import streamlit as st
st.title('E‑Commerce Dashboard')
st.write('Select a Category')
category = st.selectbox('Category', df['category'].unique())
filtered = df[df['category'] == category]
st.line_chart(filtered.groupby('month')['total_amount'].sum())
> **Why it matters** – Dashboards let stakeholders interact with the data, validating insights before you build a model.
---
## 3.9 Best Practices & Reproducibility
| Practice | Description |
|----------|-------------|
| **Version Control** | Use Git for notebooks, scripts, and Jupyter kernels. |
| **Environment Pinning** | `requirements.txt` or `environment.yml` for deterministic builds. |
| **Data Lineage** | Track the SQL query, timestamp, and transformations applied. |
| **Automated EDA Reports** | Use libraries like `pandas-profiling` or `sweetviz` to generate one‑page summaries. |
| **Documentation** | Inline comments, Markdown cells, and a `docs/` folder. |
python
# Example: pandas‑profiling report
from pandas_profiling import ProfileReport
profile = ProfileReport(df, title='EDA Profile', explorative=True)
profile.to_file("eda_report.html")
---
## 3.10 Case Study: E‑Commerce Product Clusters
### 3.10.1 Context
After cleaning the PostgreSQL data, you engineered the following product‑level features:
- `category_frequency` – how often a product appears across all orders.
- `average_order_value` – mean revenue per order containing the product.
- `price_to_stock_ratio` – price divided by available stock.
### 3.10.2 Clustering Results
| Cluster | Avg Category Frequency | Avg Order Value | Avg Price/Stock | Primary Product Types |
|---------|------------------------|-----------------|-----------------|-----------------------|
| 0 | 12.4 | $45.67 | 0.07 | Electronics, Home Appliances |
| 1 | 3.8 | $12.10 | 0.15 | Apparel, Accessories |
| 2 | 18.9 | $78.23 | 0.05 | Luxury Goods |
| 3 | 1.5 | $4.87 | 0.30 | Discounted Items |
> **Recommendation** – Launch a **Tiered Loyalty Program** where Cluster 0 customers receive early access to new gadgets, Cluster 1 customers get style‑inspiration newsletters, and Cluster 3 gets flash‑sale alerts. This aligns product affinity with marketing spend.
---
### Takeaway
EDA is the bedrock of any data‑science endeavor. It transforms raw tables into actionable stories, uncovers hidden patterns, and safeguards your models from garbage‑in‑garbage‑out pitfalls. Mastering the techniques in this chapter will enable you to navigate any dataset with confidence and craft visual narratives that resonate with stakeholders.
---