返回目錄
A
Data Science for Strategic Decision-Making: A Practical Guide - 第 3 章
Chapter 3: Exploratory Data Analysis & Visualization
發布於 2026-03-03 18:12
# Chapter 3: Exploratory Data Analysis & Visualization
> **Goal:** Turn a clean, engineered dataset into actionable insights that shape the next strategic decision.
## 3.1 Why EDA Matters
- **Bridge the gap** between raw data and models.
- **Validate assumptions** (e.g., normality, independence).
- **Uncover patterns, outliers, and missing‑value structures** that can influence downstream modeling.
- **Guide feature engineering** by revealing relationships that can be captured.
In strategic decision‑making, the *story* you tell with EDA determines where the organization focuses its resources.
## 3.2 Core Components of an EDA Workflow
| Step | Purpose | Typical Techniques | Example Tools |
|------|---------|-------------------|---------------|
| 1. Data Overview | Understand the shape and scope | `shape`, `head`, `info` | Pandas, Polars |
| 2. Data Quality Check | Detect missing, duplicate, inconsistent values | `isnull`, `duplicated`, `describe` | Pandas, Dask |
| 3. Univariate Analysis | Profile each feature | Histogram, Boxplot, KDE | Matplotlib, Seaborn |
| 4. Bivariate Analysis | Examine pairwise relationships | Scatter, Pairplot, Correlation heatmap | Seaborn, Plotly |
| 5. Multivariate Analysis | Explore high‑dimensional patterns | PCA, t‑SNE, UMAP | Scikit‑learn, Scanpy |
| 6. Outlier & Anomaly Detection | Flag unusual observations | IQR, Z‑score, Isolation Forest | SciPy, PyOD |
| 7. Visualization Storytelling | Communicate findings | Dashboard, Interactive plots | Tableau, Power BI, Dash |
### 3.2.1 Data Overview
python
import pandas as pd
df = pd.read_csv("sales_data.csv")
print(df.shape) # (n_rows, n_columns)
print(df.head())
print(df.info())
### 3.2.2 Data Quality Check
python
# Missingness
missing = df.isnull().mean().sort_values(ascending=False)
print("Missing percentage per column:\n", missing)
# Duplicates
print(f"Duplicate rows: {df.duplicated().sum()}")
> **Tip:** Visualize missingness with `missingno.matrix(df)` to quickly spot patterns.
## 3.3 Univariate Analysis
| Distribution | When to Use | Typical Plot | Interpretation |
|--------------|-------------|--------------|----------------|
| Continuous | Look for skewness, kurtosis | Histogram, KDE, Boxplot | Identify heavy tails, outliers |
| Categorical | Frequency of categories | Bar chart, Count plot | Detect dominant or rare categories |
python
import seaborn as sns
import matplotlib.pyplot as plt
# Continuous variable
sns.histplot(df["revenue"], kde=True)
plt.title("Revenue Distribution")
plt.show()
# Categorical variable
sns.countplot(x="region", data=df)
plt.title("Customer Distribution by Region")
plt.show()
## 3.4 Bivariate Analysis
Correlation metrics: Pearson (linear), Spearman (rank), Kendall (ordinal). Visualize with heatmaps and pairplots.
python
corr = df.corr(method='pearson')
sns.heatmap(corr, annot=True, cmap="coolwarm")
plt.title("Correlation Heatmap")
plt.show()
> **Strategic Insight Example:** A high positive correlation between `price` and `sales_volume` might suggest price elasticity; this can inform pricing strategy.
## 3.5 Multivariate Analysis
Dimensionality reduction can reveal hidden structure.
python
from sklearn.decomposition import PCA
pca = PCA(n_components=2)
components = pca.fit_transform(df.select_dtypes(include=["number"]))
plt.scatter(components[:,0], components[:,1], c=df["target"], cmap='viridis')
plt.title("PCA Projection")
plt.show()
> **Decision‑Making Hook:** If two principal components capture >70% variance, a simpler model using these components may be sufficient, reducing computational cost.
## 3.6 Outlier & Anomaly Detection
Use statistical rules or ML‑based methods.
python
# IQR method
Q1 = df.quantile(0.25)
Q3 = df.quantile(0.75)
IQR = Q3 - Q1
outliers = df[(df < (Q1 - 1.5 * IQR)) | (df > (Q3 + 1.5 * IQR))]
print("Number of outlier rows:", outliers.shape[0])
ML approach:
python
from pyod.models.iforest import IForest
iforest = IForest()
iforest.fit(df.select_dtypes(include=["number"]))
anomalies = iforest.predict(df.select_dtypes(include=["number"])) # 0 normal, 1 anomaly
print(f"Detected {anomalies.sum()} anomalies")
## 3.7 Interactive Dashboards
Tools:
- **Python**: Dash, Streamlit, Bokeh
- **JavaScript**: D3.js, Vega
- **BI**: Tableau, Power BI
### 3.7.1 Quick Dash Example
python
import dash
from dash import html, dcc
import plotly.express as px
app = dash.Dash(__name__)
fig = px.histogram(df, x="revenue", nbins=30)
app.layout = html.Div([
html.H1("Revenue Distribution Dashboard"),
dcc.Graph(figure=fig)
])
if __name__ == "__main__":
app.run_server(debug=True)
> **Strategic Note:** Interactive dashboards enable executives to drill down into metrics without requiring a data scientist.
## 3.8 Reproducibility Checklist
| Item | Action | Tool |
|------|--------|------|
| Code version | Commit to Git | Git, GitHub |
| Environment | Conda env / Pipfile | Conda, Pipenv |
| Data lineage | Log source, timestamps | Airflow, dbt |
| EDA notebooks | Mark as `readonly` in shared repo | Jupyter, VS Code |
| Visuals | Save as PNG/HTML | Matplotlib, Plotly |
## 3.9 Case Study: E-commerce Sales Analysis
1. **Objective:** Identify factors driving conversion rates.
2. **Dataset:** 500k transactions, features: `customer_id`, `session_time`, `num_items`, `price_total`, `device`, `region`, `converted`.
3. **Steps:**
- Cleaned missing `device` values.
- Calculated `conversion_rate = sum(converted)/n_sessions` per region.
- Visualized `session_time` vs `conversion_rate` with a scatter plot, adding a regression line.
- Performed chi‑square test on `device` vs `converted`.
4. **Findings:** Mobile users had a 12% higher conversion rate; sessions longer than 5 min increased conversions by 8%.
5. **Strategic Decision:** Allocate budget to mobile‑optimized ads and incentivize longer sessions via gamification.
## 3.10 Exercises
1. **Missingness Pattern:** Using `missingno` library, plot the missing data matrix for the sales dataset. Identify columns with >20% missingness and propose imputation strategies.
2. **Correlation Threshold:** Compute Pearson correlations between all numeric features. Flag pairs with |r| > 0.8 and discuss potential multicollinearity concerns.
3. **Outlier Impact:** Remove outliers identified via IQR on `revenue` and compare mean and median revenue before and after.
4. **Dashboard:** Build a Streamlit dashboard that allows filtering by `region` and visualizes `sales_volume` over time.
## 3.11 Summary
Exploratory Data Analysis is the compass that guides strategic data science initiatives. By systematically summarizing data, visualizing relationships, and ensuring reproducibility, you lay a solid foundation for modeling and decision‑support. The next chapter will build on these insights to introduce statistical inference and predictive modeling, turning descriptive patterns into prescriptive actions.