Chapter 3: Exploratory Data Analysis (EDA)

發布於 2026-02-24 12:29

# Chapter 3: Exploratory Data Analysis (EDA) > **Key Insight:** *Exploratory Data Analysis is the laboratory where raw data is turned into hypotheses.* --- ## 1. What is EDA? Exploratory Data Analysis (EDA) is a systematic, iterative process of probing a dataset to uncover its underlying structure, detect anomalies, and formulate preliminary hypotheses. Unlike confirmatory statistics, which tests predefined assumptions, EDA is *discover‑and‑explore*. | EDA Element | Purpose | Typical Techniques | |-------------|---------|---------------------| | Descriptive statistics | Summarize the central tendency and dispersion | mean, median, mode, std, IQR | | Data visualization | Reveal patterns, relationships, and outliers | histograms, boxplots, scatter plots, heatmaps | | Data quality checks | Identify missingness, duplicates, and inconsistencies | missing‑value heatmap, duplicate rows count | | Feature engineering | Transform raw variables into analytic features | scaling, encoding, interaction terms | --- ## 2. The EDA Workflow 1. **Define the Problem Context** – Clarify business questions. 2. **Load and Inspect the Data** – Quick look at dimensions, dtypes, and sample rows. 3. **Summarize Univariate Distributions** – Compute statistics, plot histograms. 4. **Explore Bivariate Relationships** – Scatter, bar, or violin plots. 5. **Detect and Treat Anomalies** – Outliers, missing values. 6. **Generate Feature Insights** – Correlation matrices, mutual information. 7. **Document Findings** – Narrative, markdown, or notebooks. > *Tip:* Keep the workflow modular – write reusable functions or use Jupyter notebooks for transparency. --- ## 3. Hands‑On Example: Retail Sales Dataset We’ll walk through a real‑world dataset: **`sales.csv`** – containing daily sales for 12 product categories over 2 years. python # 1️⃣ Load the data import pandas as pd import seaborn as sns import matplotlib.pyplot as plt sales = pd.read_csv('sales.csv', parse_dates=['date']) print(sales.shape) # (731, 14) ### 3.1 Data Inspection python sales.head() | date | product_id | category | store_id | sales | price | units | promo | ... | |------------|------------|----------|----------|-------|-------|-------|-------|-----| | 2021‑01‑01 | 101 | Food | 1 | 250 | 5.00 | 50 | 0 | ... | - **Nulls** python sales.isna().sum() text sales 0 price 0 units 0 promo 0 ... - **Duplicates** python sales.duplicated().sum() text 0 ### 3.2 Univariate Analysis #### 3.2.1 Summary Statistics python sales.describe(include='all') Key insights: - Mean sales per day: **$2,500** - Standard deviation: **$1,200** - 75th percentile of units sold: **75** #### 3.2.2 Histograms & Density Plots python plt.figure(figsize=(12,4)) sns.histplot(sales['sales'], kde=True, bins=30) plt.title('Distribution of Daily Sales') plt.xlabel('Sales ($)') plt.ylabel('Frequency') plt.show() The histogram reveals a **right‑skewed** distribution – typical for revenue data. ### 3.3 Bivariate Analysis #### 3.3.1 Price vs. Units Sold python plt.figure(figsize=(6,6)) sns.scatterplot(data=sales, x='price', y='units', hue='category') plt.title('Price vs. Units Sold by Category') plt.show() Observation: *Electronics* shows a weak negative correlation; *Food* displays a strong positive correlation. #### 3.3.2 Sales Over Time python sales.set_index('date')['sales'].plot(figsize=(14,4)) plt.title('Daily Sales Trend') plt.ylabel('Sales ($)') plt.show() Seasonality peaks in December, suggesting holiday promotion opportunities. ### 3.4 Outlier Detection & Treatment python # Boxplot for sales sns.boxplot(x=sales['sales']) plt.title('Boxplot of Sales') plt.show() Outliers: sales > $6,000 on a few days. Options: - **Winsorize** (capping at 95th percentile) - **Remove** if justified (e.g., data entry error) python sales['sales_cap'] = sales['sales'].clip(upper=sales['sales'].quantile(0.95)) ### 3.5 Correlation & Heatmaps python numeric_cols = ['sales', 'price', 'units', 'promo'] corr = sales[numeric_cols].corr() sns.heatmap(corr, annot=True, cmap='coolwarm') plt.title('Correlation Matrix') plt.show() Positive correlation between *promo* and *sales* indicates promotions drive revenue. ### 3.6 Feature Engineering Ideas | Feature | Rationale | |---------|-----------| | **sales_per_unit** | `sales / units` to capture price elasticity | | **month** | Extract month for seasonal analysis | | **is_holiday** | Flag holidays to study promotion impact | python sales['month'] = sales['date'].dt.month sales['sales_per_unit'] = sales['sales'] / sales['units'] --- ## 4. Best Practices for EDA | Practice | Why it Matters | |----------|----------------| | **Use Reproducible Scripts** | Enables audit trails and collaboration | | **Document Every Decision** | Clarifies assumptions when building models | | **Iterate Quickly** | EDA is exploratory; rapid iterations reveal patterns faster | | **Visualize with Context** | Include titles, axis labels, and legends to avoid misinterpretation | | **Check Data Types Early** | Prevents errors in calculations (e.g., string numeric fields) | | **Integrate Domain Knowledge** | Guides which relationships to probe | --- ## 5. Common Pitfalls and How to Avoid Them | Pitfall | Consequence | Mitigation | |----------|-------------|------------| | **Over‑fitting to Visual Noise** | Misguided hypotheses | Use statistical tests to confirm visual patterns | | **Ignoring Missing Data** | Biased estimates | Apply imputation or flag missingness | | **Skipping Data Validation** | Garbage in, garbage out | Cross‑check against source systems | | **Relying on a Single Metric** | Oversimplification | Combine multiple summary statistics | | **Neglecting Time‑Series Properties** | Misinterpretation of trends | Plot with time axis, check for autocorrelation | --- ## 6. Tools & Libraries | Library | Strengths | |---------|-----------| | **pandas** | Data manipulation & aggregation | | **NumPy** | Fast numerical operations | | **Matplotlib** | Customizable plots | | **Seaborn** | Statistical visualizations & aesthetics | | **Plotly** | Interactive plots for dashboards | | **Sweetviz / pandas_profiling** | Auto‑generated EDA reports | | **Great Expectations** | Data validation & quality checks | --- ## 7. Take‑Away Checklist - [ ] Verify data types and missing values. - [ ] Compute univariate statistics and visualize distributions. - [ ] Explore pairwise relationships with scatter plots or correlation matrices. - [ ] Identify outliers and decide on treatment. - [ ] Generate new features aligned with business questions. - [ ] Document insights in a narrative report. - [ ] Review findings with domain experts before modeling. --- ## 8. Real‑World Case Study: Customer Retention for a SaaS Platform **Context**: A SaaS company wants to predict churn. The dataset contains user activity logs, subscription details, and support tickets. 1. **Univariate**: High mean days‑active per month; skewness indicates a small group of highly active users. 2. **Bivariate**: Strong negative correlation between support tickets and retention. 3. **Time‑Series**: Notice a drop in activity during the first 30 days post‑signup. 4. **Outlier Handling**: Exclude users with anomalously high ticket counts due to system bugs. 5. **Feature Engineering**: Create `avg_session_duration`, `support_ticket_count`, and `days_since_last_login`. 6. **Result**: EDA guided the choice of a tree‑based model and highlighted the importance of early engagement metrics. --- ## 9. Next Steps With a solid EDA foundation, you’re ready to move into **Statistical Inference** (Chapter 4) and **Predictive Modeling** (Chapter 5). The insights you’ve captured will shape feature selection, model choice, and evaluation strategies. --- > *Remember:* *EDA is not a one‑off step; revisit it whenever you acquire new data or pivot the business question.*

Chapter 2: Foundations of Data

Chapter 4: Statistical Inference – From Data to Decision