Chapter 4: Exploratory Data Analysis (EDA)

發布於 2026-03-06 20:30

# Chapter 4: Exploratory Data Analysis (EDA) ## 4.1 Why EDA Matters Exploratory Data Analysis is the *first* investigative step after cleaning and before modeling. It turns raw numbers into visual stories, helps you: - **Discover** hidden patterns, relationships, and distributions. - **Validate** assumptions (e.g., normality, independence). - **Spot** anomalies, outliers, and missing‑value regimes. - **Generate** hypotheses that guide feature engineering and model choice. - **Communicate** insights to stakeholders with reproducible visuals. In short, EDA is both a science and an art: you rely on statistical tests, but you also read the *look* of your data. --- ## 4.2 Core Concepts & Terminology | Term | Definition | |------|------------| | **Univariate** | Analysis of a single variable (e.g., histogram of age). | | **Bivariate** | Joint analysis of two variables (e.g., scatter plot of income vs. spending). | | **Multivariate** | Analysis involving three or more variables (e.g., pairplot or heatmap). | | **Skewness** | Measure of asymmetry in a distribution. | | **Kurtosis** | Measure of tail‑heaviness; tells you if data have outliers. | | **Correlation** | Strength and direction of linear relationship (Pearson) or monotonic relationship (Spearman). | | **Covariance** | Unstandardized measure of joint variability. | | **Missing Completely At Random (MCAR)** | Probability of missingness independent of data. | | **Missing At Random (MAR)** | Missingness depends on observed data but not on unobserved data. | | **Missing Not At Random (MNAR)** | Missingness depends on unobserved data. | --- ## 4.3 Toolset Overview | Library | Primary Use | Strength | |---------|-------------|----------| | **Matplotlib** | Base plotting, fine‑grained control | Mature, highly customizable | | **Seaborn** | Statistical visualizations built on Matplotlib | Built‑in themes, concise syntax | | **Plotly** | Interactive, web‑ready plots | Hover info, 3‑D plots, easy sharing | | **Missingno** | Visualizing missing‑value patterns | Quick heatmap, bar, matrix | | **Pandas** | Data handling & aggregation | Powerful groupby & descriptive stats | --- ## 4.4 Visual Exploration ### 4.4.1 Univariate Analysis python import pandas as pd import seaborn as sns import matplotlib.pyplot as plt df = pd.read_csv('titanic.csv') # Histogram & KDE sns.histplot(df['Age'].dropna(), kde=True, color='steelblue') plt.title('Age Distribution') plt.show() **Takeaway:** Inspect shape, center, and spread. If the histogram is heavily skewed, consider transformation. ### 4.4.2 Bivariate Analysis python # Scatter plot with regression line sns.lmplot(x='Fare', y='Survived', data=df, height=6, aspect=1.2, ci=None) plt.title('Fare vs. Survival') plt.show() Use **pairplot** for many variables: python sns.pairplot(df[['Pclass', 'Sex', 'Age', 'Survived']], hue='Survived') plt.show() ### 4.4.3 Multivariate Analysis python # Correlation heatmap corr = df.corr() sns.heatmap(corr, annot=True, cmap='coolwarm', fmt='.2f') plt.title('Correlation Matrix') plt.show() **Interactive Dashboards** – with Plotly Dash or Streamlit for stakeholders. --- ## 4.5 Statistical Summaries | Statistic | How to compute | When to use | |-----------|----------------|--------------| | Mean | `df.mean()` | Central tendency for symmetric data | | Median | `df.median()` | Robust to outliers | | Standard Deviation | `df.std()` | Dispersion | | Skewness | `df.skew()` | Detect asymmetry | | Kurtosis | `df.kurtosis()` | Heavy‑tailedness | | Correlation | `df.corr(method='pearson')` | Linear relationships | | Chi‑square | `scipy.stats.chi2_contingency()` | Categorical association | Example: Computing descriptive stats for numerical columns: python print(df.describe().T[['mean', 'std', 'min', '25%', '50%', '75%', 'max']]) --- ## 4.6 Missingness & Outliers ### 4.6.1 Visualizing Missingness python import missingno as msno msno.matrix(df) plt.title('Missingness Matrix') plt.show() ### 4.6.2 Outlier Detection | Method | Formula | Typical Threshold | |--------|---------|-------------------| | IQR | `Q3 - Q1` | 1.5 * IQR above Q3 or below Q1 | | Z‑score | `(x - μ)/σ` | |z| > 3 | | Robust | Median ± 1.5 * MAD | Robust to extreme values | Example using IQR: python Q1 = df['Fare'].quantile(0.25) Q3 = df['Fare'].quantile(0.75) IQR = Q3 - Q1 lower = Q1 - 1.5 * IQR upper = Q3 + 1.5 * IQR outliers = df[(df['Fare'] < lower) | (df['Fare'] > upper)] print('Number of outliers:', outliers.shape[0]) --- ## 4.7 Time‑Series Specific EDA 1. **Decompose** trend, seasonality, residuals (statsmodels `seasonal_decompose`). 2. **Autocorrelation** (pandas `autocorr`, statsmodels `plot_acf`). 3. **Stationarity** tests (ADF test). 4. **Seasonal Subseries Plot**. Example: python import statsmodels.api as sm ts = df.set_index('Date')['Sales'] decomp = sm.tsa.seasonal_decompose(ts, model='multiplicative') decomp.plot() plt.show() --- ## 4.8 Reproducible EDA Workflow | Step | Implementation | Notes | |------|----------------|-------| | 1 | Load data with a fixed seed | `np.random.seed(42)` | | 2 | Version‑control notebooks | Git + Data Version Control (DVC) | | 3 | Store plots in a dedicated folder | `figures/` | | 4 | Automate summary stats | `df.describe().to_csv('summary.csv')` | | 5 | Use `ipywidgets` for interactive filters | Great for demos | | 6 | Document assumptions | Inline Markdown cells | --- ## 4.9 Common Pitfalls & How to Avoid Them | Pitfall | Why it’s problematic | Remedy | |---------|--------------------|--------| | **Correlation ≠ Causation** | Mistaking a spurious relationship for a causal link | Check domain knowledge, control variables | | **Data Leakage during EDA** | Using future information (e.g., mean of entire dataset) before training | Split data first, compute stats on training set | | **Over‑fitting to Visuals** | Tailoring features to specific plots, ignoring broader context | Validate patterns with statistical tests | | **Ignoring Context** | Treating numbers as isolated | Keep business problem at the core | | **Neglecting Missingness Mechanism** | Imputing blindly without understanding MCAR/MAR/MNAR | Perform missingness tests, report assumptions | --- ## 4.10 Summary & Key Takeaways 1. **EDA is exploratory, not prescriptive.** Use it to guide decisions, not to decide everything. 2. **Visuals + stats** – Combine plots with descriptive statistics for robust insights. 3. **Document everything** – Reproducibility is the backbone of a trustworthy data science project. 4. **Respect data types** – Tailor techniques to numerical, categorical, datetime, or text data. 5. **Keep an eye on ethics** – Visual misrepresentations can mislead stakeholders and propagate bias. 6. **Iterate** – EDA is a loop: refine plots, discover new variables, revisit hypotheses. > *“A well‑crafted EDA is the compass that points data scientists toward the most promising directions in the data‑land.”* – 墨羽行 --- ## 4.11 Further Reading & Resources | Resource | Focus | |----------|-------| | *Python Data Science Handbook* – Jake VanderPlas | Practical code examples | | *Data Visualization with ggplot2* – Hadley Wickham | Theoretical foundations for visual encoding | | *An Introduction to Statistical Learning* – Gareth James et al. | Correlation, hypothesis testing | | Plotly Docs | Interactive plot features | | Missingno Docs | Visualizing missing data | | Statsmodels API | Time‑series decomposition | --- ## 4.12 Hands‑On Exercise 1. Download the **Adult Income** dataset from UCI. 2. Perform the following EDA steps: - Univariate plots for each numeric feature. - Correlation heatmap; identify the strongest predictor of income. - Visualize missingness patterns. - Detect outliers in *Hours-per-week*. 3. Compile a 2‑page report with plots, tables, and a brief narrative. 4. Share your Jupyter notebook on GitHub, including a `requirements.txt`. Good luck, and remember: *The goal of EDA is to ask the right questions, not just to show charts.*

Chapter 3: Cleaning the Chaos – From Raw Noise to Reliable Assets

Chapter 5: Feature Engineering & Dimensionality Reduction