Chapter 3: Exploratory Data Analysis & Visualization

發布於 2026-03-03 18:12

# Chapter 3: Exploratory Data Analysis & Visualization > **Goal:** Turn a clean, engineered dataset into actionable insights that shape the next strategic decision. ## 3.1 Why EDA Matters - **Bridge the gap** between raw data and models. - **Validate assumptions** (e.g., normality, independence). - **Uncover patterns, outliers, and missing‑value structures** that can influence downstream modeling. - **Guide feature engineering** by revealing relationships that can be captured. In strategic decision‑making, the *story* you tell with EDA determines where the organization focuses its resources. ## 3.2 Core Components of an EDA Workflow | Step | Purpose | Typical Techniques | Example Tools | |------|---------|-------------------|---------------| | 1. Data Overview | Understand the shape and scope | `shape`, `head`, `info` | Pandas, Polars | | 2. Data Quality Check | Detect missing, duplicate, inconsistent values | `isnull`, `duplicated`, `describe` | Pandas, Dask | | 3. Univariate Analysis | Profile each feature | Histogram, Boxplot, KDE | Matplotlib, Seaborn | | 4. Bivariate Analysis | Examine pairwise relationships | Scatter, Pairplot, Correlation heatmap | Seaborn, Plotly | | 5. Multivariate Analysis | Explore high‑dimensional patterns | PCA, t‑SNE, UMAP | Scikit‑learn, Scanpy | | 6. Outlier & Anomaly Detection | Flag unusual observations | IQR, Z‑score, Isolation Forest | SciPy, PyOD | | 7. Visualization Storytelling | Communicate findings | Dashboard, Interactive plots | Tableau, Power BI, Dash | ### 3.2.1 Data Overview python import pandas as pd df = pd.read_csv("sales_data.csv") print(df.shape) # (n_rows, n_columns) print(df.head()) print(df.info()) ### 3.2.2 Data Quality Check python # Missingness missing = df.isnull().mean().sort_values(ascending=False) print("Missing percentage per column:\n", missing) # Duplicates print(f"Duplicate rows: {df.duplicated().sum()}") > **Tip:** Visualize missingness with `missingno.matrix(df)` to quickly spot patterns. ## 3.3 Univariate Analysis | Distribution | When to Use | Typical Plot | Interpretation | |--------------|-------------|--------------|----------------| | Continuous | Look for skewness, kurtosis | Histogram, KDE, Boxplot | Identify heavy tails, outliers | | Categorical | Frequency of categories | Bar chart, Count plot | Detect dominant or rare categories | python import seaborn as sns import matplotlib.pyplot as plt # Continuous variable sns.histplot(df["revenue"], kde=True) plt.title("Revenue Distribution") plt.show() # Categorical variable sns.countplot(x="region", data=df) plt.title("Customer Distribution by Region") plt.show() ## 3.4 Bivariate Analysis Correlation metrics: Pearson (linear), Spearman (rank), Kendall (ordinal). Visualize with heatmaps and pairplots. python corr = df.corr(method='pearson') sns.heatmap(corr, annot=True, cmap="coolwarm") plt.title("Correlation Heatmap") plt.show() > **Strategic Insight Example:** A high positive correlation between `price` and `sales_volume` might suggest price elasticity; this can inform pricing strategy. ## 3.5 Multivariate Analysis Dimensionality reduction can reveal hidden structure. python from sklearn.decomposition import PCA pca = PCA(n_components=2) components = pca.fit_transform(df.select_dtypes(include=["number"])) plt.scatter(components[:,0], components[:,1], c=df["target"], cmap='viridis') plt.title("PCA Projection") plt.show() > **Decision‑Making Hook:** If two principal components capture >70% variance, a simpler model using these components may be sufficient, reducing computational cost. ## 3.6 Outlier & Anomaly Detection Use statistical rules or ML‑based methods. python # IQR method Q1 = df.quantile(0.25) Q3 = df.quantile(0.75) IQR = Q3 - Q1 outliers = df[(df < (Q1 - 1.5 * IQR)) | (df > (Q3 + 1.5 * IQR))] print("Number of outlier rows:", outliers.shape[0]) ML approach: python from pyod.models.iforest import IForest iforest = IForest() iforest.fit(df.select_dtypes(include=["number"])) anomalies = iforest.predict(df.select_dtypes(include=["number"])) # 0 normal, 1 anomaly print(f"Detected {anomalies.sum()} anomalies") ## 3.7 Interactive Dashboards Tools: - **Python**: Dash, Streamlit, Bokeh - **JavaScript**: D3.js, Vega - **BI**: Tableau, Power BI ### 3.7.1 Quick Dash Example python import dash from dash import html, dcc import plotly.express as px app = dash.Dash(__name__) fig = px.histogram(df, x="revenue", nbins=30) app.layout = html.Div([ html.H1("Revenue Distribution Dashboard"), dcc.Graph(figure=fig) ]) if __name__ == "__main__": app.run_server(debug=True) > **Strategic Note:** Interactive dashboards enable executives to drill down into metrics without requiring a data scientist. ## 3.8 Reproducibility Checklist | Item | Action | Tool | |------|--------|------| | Code version | Commit to Git | Git, GitHub | | Environment | Conda env / Pipfile | Conda, Pipenv | | Data lineage | Log source, timestamps | Airflow, dbt | | EDA notebooks | Mark as `readonly` in shared repo | Jupyter, VS Code | | Visuals | Save as PNG/HTML | Matplotlib, Plotly | ## 3.9 Case Study: E-commerce Sales Analysis 1. **Objective:** Identify factors driving conversion rates. 2. **Dataset:** 500k transactions, features: `customer_id`, `session_time`, `num_items`, `price_total`, `device`, `region`, `converted`. 3. **Steps:** - Cleaned missing `device` values. - Calculated `conversion_rate = sum(converted)/n_sessions` per region. - Visualized `session_time` vs `conversion_rate` with a scatter plot, adding a regression line. - Performed chi‑square test on `device` vs `converted`. 4. **Findings:** Mobile users had a 12% higher conversion rate; sessions longer than 5 min increased conversions by 8%. 5. **Strategic Decision:** Allocate budget to mobile‑optimized ads and incentivize longer sessions via gamification. ## 3.10 Exercises 1. **Missingness Pattern:** Using `missingno` library, plot the missing data matrix for the sales dataset. Identify columns with >20% missingness and propose imputation strategies. 2. **Correlation Threshold:** Compute Pearson correlations between all numeric features. Flag pairs with |r| > 0.8 and discuss potential multicollinearity concerns. 3. **Outlier Impact:** Remove outliers identified via IQR on `revenue` and compare mean and median revenue before and after. 4. **Dashboard:** Build a Streamlit dashboard that allows filtering by `region` and visualizes `sales_volume` over time. ## 3.11 Summary Exploratory Data Analysis is the compass that guides strategic data science initiatives. By systematically summarizing data, visualizing relationships, and ensuring reproducibility, you lay a solid foundation for modeling and decision‑support. The next chapter will build on these insights to introduce statistical inference and predictive modeling, turning descriptive patterns into prescriptive actions.

Chapter 2: Data Acquisition – From Need to Pipeline

Chapter 4: Statistical Inference & Predictive Modeling