Chapter 3: Exploratory Data Analysis

發布於 2026-02-27 12:22

# Chapter 3: Exploratory Data Analysis Exploratory Data Analysis (EDA) is the bridge between raw, cleaned data and the models, dashboards, or reports that ultimately influence business decisions. EDA blends statistical rigor with visual storytelling to reveal the structure, patterns, and anomalies in a dataset before any formal modeling or inference is undertaken. ## 3.1 Why EDA Matters | Goal | Why it’s important for business | Typical output | |------|----------------------------------|----------------| | Detect data quality issues | A single outlier or missing value can skew an entire forecast | Box‑plots, missing‑value heatmaps | | Identify key drivers | Knowing which variables drive revenue helps focus strategy | Correlation heatmaps, feature importance plots | | Inform modeling choices | Distribution shapes dictate which statistical tests or algorithms are appropriate | Histogram overlays, QQ‑plots | | Communicate insights early | Stakeholders often need visual evidence before committing resources | Interactive dashboards, narrative summaries | ### Practical Insight > *If you can spot a data error early, you save months of modeling time and avoid costly mis‑informed decisions.* ## 3.2 Descriptive Statistics – The First Look | Statistic | Description | Typical use | |-----------|-------------|-------------| | Mean | Center of the distribution | Quick check of central tendency | | Median | 50th percentile | Robust to outliers | | Mode | Most frequent value | Categorical data | | Standard Deviation / Variance | Spread of the data | Gauge volatility | | Quartiles | 25th, 50th, 75th percentiles | Build box‑plots | | Skewness / Kurtosis | Shape of the distribution | Detect asymmetry, heavy tails | python import pandas as pd # Quick summary of numeric columns summary = df.describe().T summary['skew'] = df.skew() summary['kurt'] = df.kurtosis() print(summary) *Tip:* For large datasets, compute *sample* statistics on a stratified subset to keep computation light. ## 3.3 Univariate Visualizations | Plot | What it shows | When to use | |------|---------------|-------------| | Histogram | Frequency distribution | Continuous variables | | Kernel Density Estimate (KDE) | Smoothed density | Compare against normal distribution | | Bar Chart | Counts of categories | Categorical variables | | Box‑plot | Quartiles, outliers | Detect extreme values | | Violin Plot | Density + box‑plot | Small samples, multi‑modal data | python import seaborn as sns import matplotlib.pyplot as plt # Histogram with KDE sns.histplot(df['sales'], kde=True, color='skyblue') plt.title('Distribution of Sales') plt.xlabel('Sales (USD)') plt.ylabel('Count') plt.show() ### Business Example A retail chain notices a spike in sales on weekends. A histogram of daily sales reveals a secondary mode around 20 % higher weekend figures, prompting a targeted promotion strategy. ## 3.4 Bivariate Relationships | Plot | What it shows | Typical correlation measure | |------|---------------|-----------------------------| | Scatter Plot | Joint distribution, linearity | Pearson r | | Joint KDE | Bivariate density | Pearson r or Spearman ρ | | Heatmap of correlation matrix | Pairwise relationships | Pearson / Spearman | | Pairplot (Seaborn) | Multiple variable plots | Visual inspection | python # Correlation heatmap corr = df.corr(method='pearson') sns.heatmap(corr, annot=True, cmap='coolwarm') plt.title('Correlation Matrix') plt.show() *Practical Insight:* Use *Spearman* for ordinal data or when relationships are monotonic but not linear. ## 3.5 Multivariate Exploration 1. **Principal Component Analysis (PCA)** – Reduce dimensionality while preserving variance. python from sklearn.decomposition import PCA pca = PCA(n_components=2) components = pca.fit_transform(df.select_dtypes(include='number')) sns.scatterplot(x=components[:,0], y=components[:,1]) 2. **t‑SNE / UMAP** – Visualize high‑dimensional clusters in 2‑D. 3. **Pairwise plots** – Inspect relationships among multiple variables simultaneously. 4. **Heatmap of missingness** – Identify patterns of missing data that may correlate with other variables. ## 3.6 Detecting Outliers and Missing Data | Technique | How it works | Typical visualization | |-----------|--------------|-----------------------| | IQR rule | Define outliers as values outside 1.5×IQR | Box‑plot | | Z‑score | Standard deviations from mean | Scatter with threshold line | | Mahalanobis distance | Multivariate outlier detection | Heatmap or scatter in PCA space | | Missing‑value heatmap | Show presence/absence per cell | `missingno.matrix` | python # Z‑score outlier detection from scipy import stats z_scores = np.abs(stats.zscore(df['sales'])) outliers = df[z_scores > 3] print(f'Found {len(outliers)} outliers') **Business Implication** – Outliers can represent fraud, data entry errors, or genuinely rare high‑impact events. Validate before removal. ## 3.7 Time‑Series Specific EDA | Plot | Insight | Common Techniques | |------|---------|-------------------| | Line chart | Trend, seasonality | Decompose with STL | | Autocorrelation Function (ACF) | Lag dependence | `statsmodels.graphics.tsaplots.plot_acf` | | Seasonal Subseries Plot | Seasonal pattern per period | Box‑plot per month | | Lag‑plot | Non‑linear dependencies | Scatter of `x[t]` vs `x[t-1]` | python import statsmodels.api as sm df['date'] = pd.to_datetime(df['date']) df.set_index('date', inplace=True) # STL decomposition stl = sm.tsa.seasonal_decompose(df['sales'], model='additive') stl.plot() plt.show() ### Example A subscription service observes a pronounced quarterly bump. EDA pinpoints that the bump aligns with a marketing campaign launch, enabling fine‑tuning of future outreach. ## 3.8 Practical Example: Retail Sales Dataset Below is a walk‑through using a fictional dataset (`sales_df`) containing: - `date` (datetime) - `store_id` (categorical) - `product_category` (categorical) - `sales` (float) - `promotion` (bool) python import pandas as pd import seaborn as sns import matplotlib.pyplot as plt # 1. Load sales_df = pd.read_csv('sales_data.csv', parse_dates=['date']) # 2. Descriptive stats print(sales_df.describe(include='all')) # 3. Univariate plot sns.histplot(sales_df['sales'], kde=True) plt.title('Sales Distribution') plt.show() # 4. Box‑plot by promotion sns.boxplot(x='promotion', y='sales', data=sales_df) plt.title('Sales vs Promotion') plt.show() # 5. Correlation heatmap corr = sales_df.corr(method='pearson') sns.heatmap(corr, annot=True) plt.title('Correlation Matrix') plt.show() # 6. Time‑series trend per store store_trend = sales_df.groupby(['date', 'store_id']).sum().reset_index() plt.figure(figsize=(12, 6)) for store in store_trend['store_id'].unique(): subset = store_trend[store_trend['store_id'] == store] plt.plot(subset['date'], subset['sales'], label=f'Store {store}') plt.legend() plt.title('Weekly Sales Trend by Store') plt.show() *Takeaway:* The visual and statistical exploration reveals that promotions drive a 15 % lift in sales, while store 5 consistently underperforms relative to others. ## 3.9 Reproducibility and Documentation | Practice | Why it matters | |----------|----------------| | Notebook comments | Explain reasoning behind each plot | | Use of Jupyter notebooks or R Markdown | Share code & results together | | Version control (Git) | Track changes to EDA scripts | | Data dictionaries | Define variable meaning and units | | Automated EDA pipelines | Run on new data batches without manual tweaking | A minimal reproducible EDA script is a *first draft* of the analytical artifact that stakeholders can inspect and extend. ## 3.10 Key Take‑aways for Decision Makers 1. **EDA is not a one‑off task** – It should be revisited as new data arrives or business questions evolve. 2. **Visuals translate numbers** – A well‑crafted chart can uncover patterns that raw tables hide. 3. **Context drives interpretation** – Always tie statistical findings back to business goals and domain knowledge. 4. **Documentation is evidence** – Keep a clear record of assumptions, transformations, and observations. 5. **Outliers and missing data require judgment** – Decide to correct, impute, or flag based on their potential impact. By embedding these practices, analysts empower executives to ask better questions and make evidence‑based decisions. --- *In the next chapter, we will formalize these observations with statistical modeling basics to translate EDA insights into predictive power.*

Chapter 2: Data Collection & Cleaning

Chapter 4: Statistical Modeling Basics