聊天視窗

Data Science for the Analytical Mind: From Raw Data to Insightful Decisions - 第 4 章

Chapter 4: Exploratory Data Analysis & Visualization

發布於 2026-03-03 15:05

# Chapter 4: Exploratory Data Analysis & Visualization Exploratory Data Analysis (EDA) is the compass that guides data scientists from raw data to meaningful insights. In this chapter we - **Define** the core concepts and why EDA matters. - **Walk through** step‑by‑step techniques using Python (pandas, matplotlib, seaborn, plotly). - **Showcase** real‑world examples and pitfalls. - **Provide** actionable best‑practice guidelines that can be applied immediately. --- ## 4.1 Why EDA Is Essential | Dimension | Purpose | Typical Questions | |-----------|---------|--------------------| | **Understanding the data** | Discover what data you actually have | How many records? Which columns are missing? | | **Identifying patterns** | Spot relationships and trends | Are sales increasing over time? | | **Uncovering anomalies** | Detect outliers or data entry errors | Why does this transaction have an unusually high value? | | **Hypothesis generation** | Formulate testable assumptions | Does higher advertising spend correlate with more conversions? | EDA is not just a statistical exercise; it’s the *first conversation* between you and the dataset. A well‑executed EDA prevents wasted effort later in the pipeline and builds a strong foundation for modeling and communication. --- ## 4.2 Core Statistical Summaries ### 4.2.1 Univariate Statistics Use **pandas** to compute descriptive statistics: python import pandas as pd df = pd.read_csv('sales.csv') print(df.describe()) The `describe()` method provides - Count, mean, standard deviation, min, max - 25th, 50th, and 75th percentiles (quartiles) > **Tip**: For categorical columns, use `value_counts()` and `unique()` to understand distribution. ### 4.2.2 Missing Data Patterns Visualize missingness with a heatmap or bar chart. python import seaborn as sns import matplotlib.pyplot as plt plt.figure(figsize=(12, 6)) sns.heatmap(df.isnull(), cbar=False, yticklabels=False) plt.title('Missing Data Heatmap') plt.show() The heatmap immediately tells you which columns have gaps and whether missingness is random. ### 4.2.3 Outlier Detection Outliers can distort model training. python import numpy as np q1 = df['revenue'].quantile(0.25) q3 = df['revenue'].quantile(0.75) iqr = q3 - q1 upper = q3 + 1.5 * iqr lower = q1 - 1.5 * iqr outliers = df[(df['revenue'] > upper) | (df['revenue'] < lower)] print('Number of outliers:', outliers.shape[0]) Visual tools like boxplots also help. python sns.boxplot(x=df['revenue']) plt.title('Revenue Boxplot') plt.show() --- ## 4.3 Distribution Visualizations ### 4.3.1 Histograms & Density Plots python sns.histplot(df['revenue'], kde=True, bins=30) plt.title('Revenue Distribution') plt.xlabel('Revenue') plt.ylabel('Frequency') plt.show() Use KDE (Kernel Density Estimate) to smooth the distribution. ### 4.3.2 Empirical Cumulative Distribution Function (ECDF) python import statsmodels.api as sm fig, ax = plt.subplots() sm.distributions.ECDF(df['revenue'])(ax=ax) ax.set_title('Revenue ECDF') plt.show() ECDFs are useful when comparing distributions across groups. --- ## 4.4 Bivariate Analysis ### 4.4.1 Scatter Plots Identify linear or non‑linear relationships. python sns.scatterplot(x='price', y='units_sold', data=df) plt.title('Price vs Units Sold') plt.show() ### 4.4.2 Correlation Matrix python corr = df.corr() plt.figure(figsize=(10, 8)) sns.heatmap(corr, annot=True, cmap='coolwarm') plt.title('Correlation Matrix') plt.show() A correlation matrix highlights potential multicollinearity and informs feature selection. ### 4.4.3 Pair Plots python sns.pairplot(df[['price', 'units_sold', 'revenue']]) plt.show() Pair plots give a quick visual of pairwise relationships. --- ## 4.5 Multivariate Techniques ### 4.5.1 Principal Component Analysis (PCA) Reduce dimensionality while preserving variance. python from sklearn.decomposition import PCA from sklearn.preprocessing import StandardScaler features = df.select_dtypes(include=['float64', 'int64']) scaled = StandardScaler().fit_transform(features) pca = PCA(n_components=2) principal_components = pca.fit_transform(scaled) plt.scatter(principal_components[:, 0], principal_components[:, 1]) plt.title('PCA 2‑D Projection') plt.xlabel('PC1') plt.ylabel('PC2') plt.show() ### 4.5.2 t‑SNE and UMAP For highly non‑linear relationships. python import umap umap_reducer = umap.UMAP(n_components=2, random_state=42) embedding = umap_reducer.fit_transform(scaled) plt.scatter(embedding[:, 0], embedding[:, 1]) plt.title('UMAP Projection') plt.show() --- ## 4.6 Interactive Dashboards While static plots are great for notebooks, interactive dashboards accelerate stakeholder exploration. | Tool | Strength | Typical Use Case | |------|----------|------------------| | **Plotly Dash** | Full‑stack Python, easy to embed in web apps | Real‑time sales monitoring | | **Streamlit** | Rapid prototyping, minimal boilerplate | Sharing quick insights with business teams | | **Tableau / Power BI** | Drag‑and‑drop, powerful visual analytics | Enterprise‑wide reporting | ### Example: Streamlit Dashboard python # app.py import streamlit as st import pandas as pd import plotly.express as px df = pd.read_csv('sales.csv') st.title('Sales Dashboard') # Filters region = st.sidebar.multiselect('Region', df['region'].unique(), default=df['region'].unique()) filtered = df[df['region'].isin(region)] # Plot fig = px.bar(filtered, x='month', y='revenue', color='product', barmode='group') st.plotly_chart(fig) Run with `streamlit run app.py`. --- ## 4.7 Best Practices & Pitfalls | Practice | Why It Matters | How to Implement | |----------|----------------|------------------| | **Keep a reproducible notebook** | Enables audit and collaboration | Use Jupyter or Colab with clear code cells and markdown explanations | | **Version‑control your data snapshots** | Avoid “data drift” during analysis | Store CSVs in a dedicated data folder and tag with dates | | **Document assumptions** | Clarifies context for stakeholders | Add a “Data Assumptions” section in the notebook | | **Avoid data leakage** | Prevents over‑optimistic results | Only use future data for predictive features | | **Use consistent scales** | Easier comparison across plots | Standardize column names and units | ### Common Pitfalls 1. **Over‑plotting** – Too many points hide structure. Use transparency or hexbin plots. 2. **Misinterpreting correlation** – Correlation does not imply causation. Always corroborate with domain knowledge. 3. **Ignoring missingness** – Treat missing data as a signal, not noise. 4. **Skipping data checks** – Unchecked outliers can break models. 5. **Assuming normality** – Many datasets are skewed; use appropriate transformations. --- ## 4.8 Summary - EDA is the *bridge* between raw data and actionable insights. - Combine descriptive statistics, visualizations, and interactive tools for a holistic view. - Keep analyses reproducible and well‑documented to scale. - Use multivariate techniques to uncover hidden structure. - Always validate findings with domain expertise. By mastering these EDA techniques, analysts set the stage for robust modeling, trustworthy insights, and meaningful business impact. --- > *“The first look at data often tells more than any model ever will.”* – **墨羽行**