Chapter 4: Exploratory Data Analysis (EDA)

發布於 2026-02-23 16:15

# Chapter 4: Exploratory Data Analysis (EDA) Exploratory Data Analysis (EDA) is the **first analytical step** that turns raw data into insights. It combines statistical summaries, visualizations, and interactive tools to uncover patterns, spot anomalies, test assumptions, and formulate hypotheses for downstream modeling. --- ## 4.1 Why EDA Matters | Benefit | Description | |---|---| | **Rapid Insight** | Discover trends before committing to heavy modeling. | | **Data Quality Check** | Spot missing values, outliers, and inconsistencies early. | | **Feature Engineering** | Identify promising variables and relationships. | | **Model Bias Prevention** | Detect skewed distributions that can bias models. | | **Storytelling** | Build visual narratives that communicate findings to stakeholders. | EDA bridges the gap between data ingestion and predictive analytics, ensuring that the data you feed into models is **clean, representative, and understood**. --- ## 4.2 Foundations: Statistical Summaries Statistical descriptors are the backbone of any EDA session. They provide a quick snapshot of central tendency, dispersion, and shape. python import pandas as pd # Load sample dataset df = pd.read_csv('data/sample.csv') # Summary statistics for numeric columns print(df.describe(include='number')) # Count of unique values for categorical columns print(df.nunique()) Key metrics: - **Mean, Median, Mode** – center - **Std, IQR, Min/Max** – spread - **Skew, Kurtosis** – distribution shape - **Correlation matrix** – linear relationships ### Visual Complement: Box Plots & Histograms python import seaborn as sns import matplotlib.pyplot as plt sns.boxplot(data=df, orient='h') plt.title('Box Plot of Numeric Features') plt.show() --- ## 4.3 Univariate Analysis ### 4.3.1 Numeric Variables | Plot | When to Use | Tool(s) | |---|---|---| | Histogram | Distribution shape | `matplotlib`, `seaborn`, `plotly` | | Density Plot | Smooth distribution | `seaborn.kdeplot`, `plotly.express.histogram` | | Box Plot | Outlier detection | `seaborn.boxplot`, `plotly.box` | python # Histogram with KDE sns.histplot(df['age'], kde=True) plt.title('Age Distribution') plt.show() ### 4.3.2 Categorical Variables | Plot | Purpose | Tool(s) | |---|---|---| | Bar Chart | Frequency counts | `seaborn.countplot`, `plotly.express.bar` | | Pie Chart | Proportion view | `matplotlib.pyplot.pie` | | Heatmap (count matrix) | Cross‑tab of two categoricals | `seaborn.heatmap` | python # Bar chart for a categorical variable sns.countplot(x='country', data=df) plt.title('Customer Distribution by Country') plt.show() --- ## 4.4 Bivariate & Multivariate Analysis Understanding pairwise relationships sets the stage for feature engineering. ### 4.4.1 Scatter Plots & Pair Plots python # Pair plot for selected features sns.pairplot(df[['age', 'income', 'spend']], hue='segment') plt.show() ### 4.4.2 Correlation Heatmaps python corr = df.corr() sns.heatmap(corr, annot=True, cmap='coolwarm') plt.title('Feature Correlation Matrix') plt.show() ### 4.4.3 Categorical‑Numeric Relationships python sns.boxplot(x='segment', y='income', data=df) plt.title('Income by Customer Segment') plt.show() --- ## 4.5 Detecting Anomalies & Outliers | Technique | How it Works | Typical Use‑Case | |---|---|---| | IQR Method | Values outside 1.5×IQR from Q1/Q3 | Detect extreme purchases | | Z‑Score | Standard deviations from mean | Identify abnormal temperatures | | Isolation Forest | Random partitioning trees | Fraud detection | python from scipy import stats # Z‑Score outlier detection df['zscore'] = stats.zscore(df['spend']) outliers = df[abs(df['zscore']) > 3] print(outliers.head()) --- ## 4.6 Feature Importance & Dimensionality While EDA focuses on data *before* modeling, it can hint at feature relevance: - **Correlation thresholds**: Drop features with |r| < 0.1 - **Variance threshold**: Remove near‑constant variables - **Multicollinearity**: Variance Inflation Factor (VIF) > 5 signals redundancy python from statsmodels.stats.outliers_influence import variance_inflation_factor X = df.select_dtypes(include=['float64', 'int64']) vif = pd.DataFrame() vif['feature'] = X.columns vif['VIF'] = [variance_inflation_factor(X.values, i) for i in range(X.shape[1])] print(vif.sort_values('VIF', ascending=False)) --- ## 4.7 Automating EDA Workflows For large, recurring datasets, manual EDA becomes impractical. Automate with: - **pandas‑profiling** or **Sweetviz** for auto‑generated reports - **Tabulate** for CLI-friendly tables - **Plotly Dash** or **Streamlit** for interactive dashboards python # Quick profile report import pandas_profiling as pp report = pp.ProfileReport(df, title='EDA Report') report.to_file('report.html') --- ## 4.8 Interactive Dashboards Stakeholders often prefer interactive exploration over static plots. Two popular frameworks: | Framework | Strengths | Typical Stack | |---|---|---| | **Plotly Dash** | Production‑ready, supports callbacks | Python, Flask, Docker | | **Streamlit** | Rapid prototyping, simple API | Python, pip, `streamlit run app.py` | ### Sample Streamlit Dashboard python import streamlit as st import pandas as pd import seaborn as sns import matplotlib.pyplot as plt st.title('Customer Spend Dashboard') df = pd.read_csv('data/sample.csv') # Sidebar filter segment = st.sidebar.multiselect('Segment', df['segment'].unique(), default=df['segment'].unique()) filtered = df[df['segment'].isin(segment)] # Plot fig, ax = plt.subplots() ax.hist(filtered['spend'], bins=30, color='steelblue') ax.set_title('Spend Distribution') st.pyplot(fig) --- ## 4.9 Best Practices Checklist | Practice | Why It Matters | |---|---| | **Document assumptions** | Ensures reproducibility | | **Keep a versioned data notebook** | Track changes over time | | **Use consistent color palettes** | Enhances readability | | **Validate with domain experts** | Aligns findings with business context | | **Automate routine plots** | Saves time and reduces errors | | **Publish dashboards to a secure gateway** | Protects sensitive data | --- ## 4.10 Case Study: Retail Sales Forecasting **Scenario**: A mid‑size retailer wants to understand seasonal sales patterns and customer segmentation before building a forecasting model. 1. **Load data** – Sales transactions, customer profiles, and product metadata. 2. **Univariate plots** – Histogram of daily sales, KDE of customer tenure. 3. **Time‑series decomposition** – Seasonal component via `statsmodels.tsa.seasonal_decompose`. 4. **Correlation heatmap** – Identify which product attributes drive sales. 5. **Cluster customers** – K‑means on purchasing behavior (to be used later in segmentation). 6. **Dashboard** – Interactive timeline, heatmap, and cluster summary using Dash. 7. **Insights** – Peak sales in Q4, high‑value customers cluster with premium products. These insights directly informed the feature set for the final forecasting model and guided marketing strategy. --- ## 4.11 Key Takeaways - **EDA is exploratory, not prescriptive** – It informs, does not decide. - **Visualization is power** – Good plots can uncover hidden patterns faster than tables. - **Interactivity accelerates insight** – Dashboards allow stakeholders to drill down and validate findings. - **Automation safeguards repeatability** – Use libraries like `pandas-profiling` and notebooks to lock in EDA steps. - **Domain knowledge amplifies EDA** – Combine statistical signals with business context for actionable recommendations. --- **Next Step**: With a solid EDA foundation, we move to Chapter 5 to harness supervised learning techniques and refine predictive models based on the insights uncovered.

Chapter 3: Building Robust Ingestion Pipelines

Chapter 5: Supervised Learning in Practice