Chapter 3: Exploratory Data Analysis (EDA)

發布於 2026-02-23 09:24

# Chapter 3: Exploratory Data Analysis (EDA) ## 3.1 Why EDA Matters in Business - **Discover hidden patterns** that can turn into actionable insights. - **Validate assumptions** before building models (e.g., linearity, homoscedasticity). - **Identify data quality issues** such as outliers, missing values, or skewed distributions. - **Guide feature engineering** and model selection by revealing variable importance. > *Business leaders often ask, “What can we learn from this data?” EDA is the answer‑tool that turns raw tables into narratives.* ## 3.2 Core Steps of an EDA Pipeline | Step | Goal | Typical Python Tools | |------|------|---------------------| | 1. **Load & Inspect** | Verify schema, data types, and head rows | `pandas.read_csv`, `pandas.DataFrame.info()` | | 2. **Summarise** | Statistical overview (mean, std, percentiles) | `pandas.DataFrame.describe()`, `scipy.stats` | | 3. **Visualise** | Understand distributions & relationships | `seaborn`, `matplotlib`, `plotly` | | 4. **Transform** | Handle outliers, missingness, scaling | `scikit‑learn` preprocessing, custom functions | | 5. **Reduce Dimensionality** | Simplify high‑dimensional data | `pca`, `t-SNE`, `UMAP` | | 6. **Document** | Keep reproducible notebooks, version‑control scripts | Git, Jupyter Notebooks, VSCode Live Share | ### 3.2.1 Example: Load & Inspect ```python import pandas as pd df = pd.read_csv('sales_data.csv') print(df.info()) print(df.head()) ``` ### 3.2.2 Example: Summary Statistics ```python summary = df.describe().T print(summary[['mean','std','min','25%','50%','75%','max']]) ``` ## 3.3 Visualization Strategies | Category | Purpose | Recommended Plot | Library | |----------|---------|------------------|---------| | **Univariate** | Show distribution of a single variable | Histogram, KDE, Box plot | `seaborn`, `matplotlib` | | **Bivariate** | Explore relationship between two variables | Scatter, Line, Correlation heatmap | `seaborn`, `plotly` | | **Multivariate** | Visualise interactions in higher dimensions | Parallel Coordinates, Pair Plot | `seaborn`, `plotly.express` | | **Temporal** | Detect seasonality or trends | Line plot, Rolling stats | `matplotlib`, `pandas.plotting` | ### 3.3.1 Univariate Example ```python import seaborn as sns import matplotlib.pyplot as plt sns.histplot(df['revenue'], kde=True, bins=30) plt.title('Revenue Distribution') plt.xlabel('Revenue ($)') plt.ylabel('Frequency') plt.show() ``` ### 3.3.2 Bivariate Example ```python sns.scatterplot(data=df, x='advertising_spend', y='revenue', hue='region') plt.title('Revenue vs Advertising Spend by Region') plt.show() ``` ## 3.4 Statistical Summaries & Diagnostic Tests | Metric | Interpretation | Code Snippet | |--------|----------------|--------------| | **Correlation** | Strength & direction of linear relationship | `df.corr()` | | **Skewness / Kurtosis** | Deviation from normality | `scipy.stats.skew(df['col'])` | | **Shapiro–Wilk Test** | Test normality | `scipy.stats.shapiro(df['col'])` | | **Kolmogorov–Smirnov Test** | Compare distributions | `scipy.stats.ks_2samp(df['col1'], df['col2'])` | Example: Check normality of `age` column ```python from scipy import stats stat, p = stats.shapiro(df['age']) print(f'Shapiro-Wilk p-value: {p:.4f}') if p < 0.05: print('Data likely not normal.') else: print('Cannot reject normality.') ``` ## 3.5 Dimensionality Reduction Techniques | Technique | When to Use | Typical Implementation | |-----------|-------------|------------------------| | **PCA** | Linear relationships, large feature sets | `sklearn.decomposition.PCA` | | **t‑SNE** | Non‑linear, visualise clusters | `sklearn.manifold.TSNE` | | **UMAP** | Preserve global structure, faster than t‑SNE | `umap-learn` | ### 3.5.1 PCA Example ```python from sklearn.decomposition import PCA from sklearn.preprocessing import StandardScaler features = df[['feature1','feature2','feature3','feature4']] scaler = StandardScaler() scaled = scaler.fit_transform(features) pca = PCA(n_components=2) principal_components = pca.fit_transform(scaled) plt.figure(figsize=(8,6)) plt.scatter(principal_components[:,0], principal_components[:,1]) plt.xlabel('PC1') plt.ylabel('PC2') plt.title('PCA of Product Features') plt.show() ``` ## 3.6 EDA in the Context of Model Selection | Data Insight | Model Implication | |---------------|-------------------| | **Linear relationship** | Consider linear regression, Lasso, Ridge | | **Strong multicollinearity** | Use regularisation or dimensionality reduction | | **Non‑linear patterns** | Tree‑based methods (Random Forest, XGBoost) or neural nets | | **Class imbalance** | Imbalanced‑classification techniques (SMOTE, class weights) | **Case Study – Retail Forecasting** 1. **EDA found** a monthly seasonal trend and a weak linear trend in sales. 2. **Chosen model**: SARIMA (Seasonal ARIMA) because it captures both trend and seasonality. 3. **Result**: 12% reduction in forecast error versus a naïve rolling‑average model. ## 3.7 Reproducible EDA Practices | Practice | Rationale | Tool | |----------|-----------|------| | **Notebook version control** | Audit trail of analysis | Git, Jupyter Notebook extensions | | **Seed management** | Consistent random splits | `numpy.random.seed()`, `scikit-learn.set_random_state()` | | **Automated reports** | Stakeholder communication | `nbconvert`, `papermill`, `DataDog` dashboards | | **Data lineage tracking** | Trace back from insights to raw files | `great_expectations`, `dbt` | ```bash # Example: Commit notebook after each EDA session git add sales_eda.ipynb git commit -m "Add initial EDA with outlier handling" ``` ## 3.8 Interactive and Production‑Ready EDA - **Dashboards**: `Plotly Dash`, `Streamlit` for real‑time exploration. - **Automated EDA tools**: `pandas-profiling`, `sweetviz` generate full reports with one line. - **Integration with ML pipelines**: Store EDA artifacts (plots, summary tables) as model artifacts in MLflow or DVC. ```python import sweetviz as sv report = sv.analyze(df) report.show_html('eda_report.html') ``` ## 3.9 Take‑Away Checklist for Business Leaders - **Ask**: What are the business questions we want to answer? EDA should align with them. - **Validate**: Are the data patterns stable over time? Re‑run EDA periodically. - **Document**: Keep a versioned record of the EDA notebook and key findings. - **Communicate**: Translate plots into business‑relevant narratives. - **Iterate**: Use insights to refine data collection and feature engineering. --- *In the next chapter, we’ll translate the insights from this EDA into robust, supervised learning models.*

Chapter 2: Data Acquisition & Cleaning

Chapter 4: From Insight to Prediction – Building Supervised Learning Models