Chapter 3: Exploratory Data Analysis & Visualization

發布於 2026-02-28 20:42

# Chapter 3: Exploratory Data Analysis & Visualization Exploratory Data Analysis (EDA) is the compass that guides every data‑science project. Before building a model or writing a report, you need to understand the data’s shape, its quirks, and the stories it can tell. This chapter walks you through the full EDA workflow, from data understanding to interactive dashboards, and shows how to turn raw tables from a PostgreSQL database into compelling visual narratives. --- ## 3.1 Data Understanding ### 3.1.1 Schema Overview When you finish the cleaning phase in Chapter 2, your PostgreSQL database should look something like this: | Table | Columns (example) | Data Types | |---------|--------------------|------------| | **orders** | order_id, customer_id, order_date, total_amount | INT, INT, TIMESTAMP, DECIMAL | | **products** | product_id, category, price, stock | INT, TEXT, DECIMAL, INT | | **order_items** | item_id, order_id, product_id, quantity | INT, INT, INT, INT | > **Why it matters** – Knowing the schema helps you write efficient SQL joins and avoid costly data‑type mismatches. ### 3.1.2 Data Connectivity python import pandas as pd from sqlalchemy import create_engine engine = create_engine( "postgresql+psycopg2://user:password@localhost:5432/ecommerce" ) orders = pd.read_sql("SELECT * FROM orders", engine) products = pd.read_sql("SELECT * FROM products", engine) order_items = pd.read_sql("SELECT * FROM order_items", engine) > **Tip** – Keep a `README.md` in your project folder that records the database URI, credentials (use environment variables), and the SQL schema. This becomes invaluable when you re‑run analyses or hand off the project to a teammate. --- ## 3.2 Summary Statistics | Measure | Function | Example | |---------|----------|---------| | **Count** | `df.shape[0]` | 1,024,000 rows | | **Mean** | `df.mean()` | 59.73 (average order total) | | **Std** | `df.std()` | 42.15 | | **Min / Max** | `df.min()`, `df.max()` | 1.00 / 999.99 | | **Quantiles** | `df.quantile([0.25, 0.5, 0.75])` | 25th / 50th / 75th percentiles | ### 3.2.1 Data Types & Nulls python # Detect missing values missing_summary = df.isna().sum().sort_values(ascending=False) print(missing_summary) ### 3.2.2 Feature‑Level Descriptions python desc = df.describe(include='all') print(desc.head()) > **Why it matters** – Summary stats surface anomalies (e.g., a negative `price`) early, and give you a baseline for visual checks. --- ## 3.3 Distribution Visualizations | Distribution | Plot | Library | |--------------|------|---------| | **Histogram** | `plt.hist()` | Matplotlib | | **Kernel Density** | `sns.kdeplot()` | Seaborn | | **Boxplot** | `sns.boxplot()` | Seaborn | | **Bar Chart** | `plt.bar()` | Matplotlib | ### 3.3.1 Example: Order Total python import matplotlib.pyplot as plt import seaborn as sns plt.figure(figsize=(10, 4)) sns.histplot(orders['total_amount'], bins=50, kde=True) plt.title('Distribution of Order Totals') plt.xlabel('Total Amount') plt.ylabel('Frequency') plt.show() > **Interpretation** – A right‑skewed histogram signals a small number of high‑value orders, which may be worth targeted marketing. --- ## 3.4 Relationship Analysis ### 3.4.1 Correlation Matrix python corr = orders.corr() plt.figure(figsize=(8, 6)) sns.heatmap(corr, annot=True, cmap='coolwarm') plt.title('Feature Correlation') plt.show() ### 3.4.2 Pair Plots python sns.pairplot(orders[['total_amount', 'price', 'quantity']]) plt.show() ### 3.4.3 Group‑by Aggregations python grouped = orders.groupby('category')['total_amount'].agg(['mean', 'sum', 'count']).reset_index() print(grouped.head()) > **Why it matters** – Correlations help you decide which features to keep, drop, or engineer, and group‑by stats reveal which product categories drive revenue. --- ## 3.5 Missingness & Outlier Detection | Technique | Purpose | Example | |-----------|---------|---------| | **Heatmap** | Visual missingness pattern | `sns.heatmap(df.isna(), cbar=False)` | | **Isolation Forest** | Detect anomalies | `from sklearn.ensemble import IsolationForest` | | **Z‑score** | Simple outlier threshold | `np.abs(stats.zscore(df['price'])) > 3` | ### 3.5.1 Missingness Heatmap python plt.figure(figsize=(12, 6)) sns.heatmap(df.isna(), cbar=False) plt.title('Missing Data Heatmap') plt.show() ### 3.5.2 Isolation Forest python from sklearn.ensemble import IsolationForest iso = IsolationForest(contamination=0.01, random_state=42) pred = iso.fit_predict(df[['price', 'quantity']]) df['anomaly'] = pred print(df[df['anomaly'] == -1].head()) > **Tip** – Document every anomaly flag; sometimes what looks like an outlier is a valid edge case (e.g., a wholesale order). --- ## 3.6 Temporal Patterns | Time Feature | How to Extract | Library | |--------------|----------------|---------| | **Month** | `df['order_date'].dt.month` | Pandas | | **Day of Week** | `df['order_date'].dt.dayofweek` | Pandas | | **Hour** | `df['order_date'].dt.hour` | Pandas | ### 3.6.1 Sales by Month python orders['month'] = orders['order_date'].dt.to_period('M') monthly_sales = orders.groupby('month')['total_amount'].sum() monthly_sales.plot(kind='bar', figsize=(10, 4)) plt.title('Monthly Sales') plt.ylabel('Revenue') plt.show() > **Why it matters** – Seasonal trends inform inventory planning and promotional calendars. --- ## 3.7 Dimensionality Reduction & Clustering ### 3.7.1 Principal Component Analysis (PCA) python from sklearn.decomposition import PCA from sklearn.preprocessing import StandardScaler scaler = StandardScaler() scaled = scaler.fit_transform(df[['price', 'quantity', 'stock']]) pca = PCA(n_components=2) principal = pca.fit_transform(scaled) plt.scatter(principal[:, 0], principal[:, 1]) plt.title('PCA of Product Features') plt.xlabel('PC1') plt.ylabel('PC2') plt.show() ### 3.7.2 K‑Means Clustering python from sklearn.cluster import KMeans kmeans = KMeans(n_clusters=4, random_state=42) clusters = kmeans.fit_predict(principal) plt.scatter(principal[:, 0], principal[:, 1], c=clusters, cmap='viridis') plt.title('K‑Means Product Clusters') plt.show() > **Case Study** – In Chapter 4 you’ll see how these clusters translate into targeted recommendation engines. --- ## 3.8 Interactive Dashboards ### 3.8.1 Plotly Express python import plotly.express as px fig = px.scatter(df, x='price', y='total_amount', color='category', hover_data=['product_id']) fig.update_layout(title='Price vs. Order Total by Category') fig.show() ### 3.8.2 Streamlit python import streamlit as st st.title('E‑Commerce Dashboard') st.write('Select a Category') category = st.selectbox('Category', df['category'].unique()) filtered = df[df['category'] == category] st.line_chart(filtered.groupby('month')['total_amount'].sum()) > **Why it matters** – Dashboards let stakeholders interact with the data, validating insights before you build a model. --- ## 3.9 Best Practices & Reproducibility | Practice | Description | |----------|-------------| | **Version Control** | Use Git for notebooks, scripts, and Jupyter kernels. | | **Environment Pinning** | `requirements.txt` or `environment.yml` for deterministic builds. | | **Data Lineage** | Track the SQL query, timestamp, and transformations applied. | | **Automated EDA Reports** | Use libraries like `pandas-profiling` or `sweetviz` to generate one‑page summaries. | | **Documentation** | Inline comments, Markdown cells, and a `docs/` folder. | python # Example: pandas‑profiling report from pandas_profiling import ProfileReport profile = ProfileReport(df, title='EDA Profile', explorative=True) profile.to_file("eda_report.html") --- ## 3.10 Case Study: E‑Commerce Product Clusters ### 3.10.1 Context After cleaning the PostgreSQL data, you engineered the following product‑level features: - `category_frequency` – how often a product appears across all orders. - `average_order_value` – mean revenue per order containing the product. - `price_to_stock_ratio` – price divided by available stock. ### 3.10.2 Clustering Results | Cluster | Avg Category Frequency | Avg Order Value | Avg Price/Stock | Primary Product Types | |---------|------------------------|-----------------|-----------------|-----------------------| | 0 | 12.4 | $45.67 | 0.07 | Electronics, Home Appliances | | 1 | 3.8 | $12.10 | 0.15 | Apparel, Accessories | | 2 | 18.9 | $78.23 | 0.05 | Luxury Goods | | 3 | 1.5 | $4.87 | 0.30 | Discounted Items | > **Recommendation** – Launch a **Tiered Loyalty Program** where Cluster 0 customers receive early access to new gadgets, Cluster 1 customers get style‑inspiration newsletters, and Cluster 3 gets flash‑sale alerts. This aligns product affinity with marketing spend. --- ### Takeaway EDA is the bedrock of any data‑science endeavor. It transforms raw tables into actionable stories, uncovers hidden patterns, and safeguards your models from garbage‑in‑garbage‑out pitfalls. Mastering the techniques in this chapter will enable you to navigate any dataset with confidence and craft visual narratives that resonate with stakeholders. ---

Chapter 2: From Raw Signals to Insightful Narratives

Chapter 4: The Art and Science of Exploratory Data Analysis