Chapter 2: Foundations of Data Analytics

發布於 2026-02-27 18:03

# Chapter 2: Foundations of Data Analytics In this chapter we lay the groundwork for all subsequent analysis. Understanding the statistical concepts that underpin inference, along with a disciplined approach to data types, quality, and cleaning, ensures that every model built on top of this foundation is reliable, reproducible, and insightful. --- ## 1. Statistical Fundamentals | Concept | Definition | Practical Relevance | |---------|------------|---------------------| | **Distribution** | The probability law that describes how values of a random variable are spread. | Guides the choice of models and tests; e.g., normality assumption in linear regression. | **Hypothesis Testing** | Formal framework to decide if evidence supports a claim about a population. | Helps validate business decisions (A/B tests, policy changes). | **Confidence Interval** | Range of values that likely contain the true parameter with a specified probability. | Provides uncertainty estimates for estimates, crucial for risk management. ### 1.1 Probability Distributions | Distribution | Typical Use‑Cases | Key Parameters | |--------------|------------------|----------------| | Normal | Continuous, symmetric data (e.g., heights, test scores) | \\mu (mean), \\sigma^2 (variance) | | Binomial | Number of successes in fixed trials (e.g., click‑throughs) | n (trials), p (success probability) | | Poisson | Count of events in a fixed interval (e.g., call arrivals) | \\lambda (rate) | | Exponential | Time between events (e.g., failure times) | \\lambda (rate) | **Practical tip:** Visualize the empirical distribution using histograms or density plots before fitting a parametric model. In Python, `seaborn.kdeplot` or `scipy.stats` provide quick diagnostics. ### 1.2 Hypothesis Testing Basics 1. **State null (H₀) and alternative (H₁) hypotheses**. 2. **Choose significance level** (α, common values 0.05 or 0.01). 3. **Select test statistic** (t, z, χ², etc.). 4. **Compute p‑value** and compare to α. 5. **Make a decision**: reject or fail to reject H₀. **Example (Python):** ```python import scipy.stats as stats # One‑sample t‑test: does mean of sample differ from 50? sample = [52, 47, 53, 49, 51, 54, 48] stat, p = stats.ttest_1samp(sample, 50) print(f"t‑statistic={stat:.3f}, p‑value={p:.3f}") ``` ### 1.3 Confidence Intervals A 95 % CI for a population mean µ is: \[ \bar{x} \pm 1.96 \times \frac{s}{\sqrt{n}} \] Where `\bar{x}` is the sample mean, `s` the sample standard deviation, and `n` the sample size. **Practical tip:** Use `statsmodels` or `scipy.stats` to compute CIs for various parameters automatically. --- ## 2. Data Types, Quality, and Cleaning Basics | Data Type | Typical Representation | Common Challenges | |-----------|------------------------|-------------------| | **Numerical** (int, float) | Continuous or discrete numbers | Missing values, outliers, scaling | | **Categorical** (string, category) | Labels or tokens | High cardinality, inconsistent labels | | **Datetime** | Timestamps, dates | Time zone mismatches, irregular intervals | | **Text** | Free‑form strings | Noise, varying encoding | ### 2.1 Data Quality Dimensions | Dimension | Description | Typical Remedies | |-----------|-------------|------------------| | **Accuracy** | Closeness to true value | Validation against reference, manual checks | | **Completeness** | Proportion of missing entries | Imputation, data augmentation | | **Consistency** | Harmonized schema and formats | Standardization, canonicalization | | **Uniqueness** | Absence of duplicates | Deduplication, primary key enforcement | | **Timeliness** | Currency of data | Regular refresh, version control | ### 2.2 Cleaning Workflow 1. **Load & Inspect**: `df.head()`, `df.info()`, `df.describe()`. 2. **Detect Missingness**: `df.isnull().sum()`. 3. **Handle Missing Data**: * Drop: `df.dropna()`. * Impute numeric: `df.fillna(df.mean())`. * Impute categorical: mode or a placeholder. 4. **Detect Outliers**: boxplots, IQR, Z‑score. 5. **Transform & Encode**: * Standardization: `StandardScaler`. * Normalization: `MinMaxScaler`. * One‑hot / target encoding for categorical. 6. **Validate**: re‑run `df.describe()` to confirm. **Code Snippet – Outlier Removal using IQR** ```python import numpy as np def remove_outliers_iqr(df, column, k=1.5): Q1 = df[column].quantile(0.25) Q3 = df[column].quantile(0.75) IQR = Q3 - Q1 lower_bound = Q1 - k * IQR upper_bound = Q3 + k * IQR return df[(df[column] >= lower_bound) & (df[column] <= upper_bound)] # Example usage clean_df = remove_outliers_iqr(df, "age") ``` ### 2.3 Data Profiling Tools | Tool | Language | Strength | |------|----------|----------| | `pandas_profiling` | Python | Generates interactive HTML reports in seconds | | `sweetviz` | Python | Visual dashboards for comparison | | `DataCleaner` | R | Comprehensive cleaning and imputation pipelines | | `Dataiku` | Enterprise | Drag‑and‑drop with automated profiling | **Practical tip:** Run a quick profiling report at the start of a project to surface hidden data quality issues before investing in complex modeling. --- ## 3. Putting It Together – A Mini‑Project | Step | Task | Tools | |------|------|-------| | 1 | Define business question | Jupyter Notebook | | 2 | Load data | `pandas.read_csv` | | 3 | Explore & profile | `pandas_profiling`, `seaborn` | | 4 | Clean & transform | `pandas`, `scikit-learn` scalers | | 5 | Statistical tests | `scipy.stats` | | 6 | Report results | Markdown, Matplotlib | **Deliverable:** A notebook that walks the reader from raw CSV to a confidence interval for a key metric, ready for stakeholder presentation. --- ## 4. Take‑Away Checklist - [ ] Verify data types before modeling. - [ ] Run a profiling report to uncover hidden issues. - [ ] Document all cleaning steps in version‑controlled scripts. - [ ] Use statistical tests to support decisions. - [ ] Communicate uncertainty via confidence intervals. By mastering these fundamentals, you create a resilient pipeline that transforms messy data into trustworthy insights. In the next chapter, we’ll learn how to visualize those insights to uncover deeper patterns and drive action.

Chapter 1: Introduction to Data Science

Chapter 3: Visualizing Insights – Turning Numbers into Narrative