聊天視窗

Data Science for the Modern Analyst: From Concepts to Implementation - 第 2 章

Chapter 2: Foundations of Statistics

發布於 2026-02-26 05:42

# Chapter 2 – Foundations of Statistics In this chapter we establish the statistical backbone that underpins every data‑driven decision. We progress from simple descriptive metrics that summarise a dataset, through probability fundamentals that quantify uncertainty, to hypothesis testing and inference that let you draw conclusions about populations from samples. --- ## 2.1 Descriptive Metrics | Metric | Symbol | Formula | Typical Use Case | |--------|--------|---------|------------------| | Mean | \(\bar{x}\) | \(\frac{1}{n}\sum_{i=1}^{n} x_i\) | Central tendency for symmetric data | | Median | \(\tilde{x}\) | Middle value when data are ordered | Robust central tendency | | Mode | – | Most frequently occurring value | Categorical or multi‑modal data | | Range | R | \(\max(x) - \min(x)\) | Spread of the outermost values | | Variance | \(s^2\) | \(\frac{1}{n-1}\sum_{i=1}^{n}(x_i-\bar{x})^2\) | Dispersion of data | | Standard Deviation | s | \(\sqrt{s^2}\) | Intuitive measure of spread | | Interquartile Range (IQR) | – | Q3 – Q1 | Robust spread measure | ### Practical Insight - **Always plot a histogram or box‑plot** alongside numeric summaries. Visuals expose skewness, kurtosis, and outliers that raw numbers can hide. - **Choose the right metric for the distribution**: use median when the data contain extreme values, mean for normal‑like data. ### Code Example python import pandas as pd import numpy as np # Simulated sales data (in dollars) sales = pd.Series([150, 130, 110, 140, 160, 155, 145]) print('Mean:', sales.mean()) print('Median:', sales.median()) print('Std Dev:', sales.std(ddof=1)) # Visualise the distribution sales.hist(bins=5) import matplotlib.pyplot as plt plt.title('Sales Distribution') plt.xlabel('Sales ($)') plt.ylabel('Frequency') plt.show() --- ## 2.2 Probability Basics ### 2.2.1 Sample Space & Events - **Sample Space (\(\Omega\))**: Set of all possible outcomes of an experiment. - **Event (\(E\))**: Any subset of \(\Omega\). E.g., rolling an even number on a die. ### 2.2.2 Axioms of Probability 1. \(P(E) \ge 0\) for any event \(E\). 2. \(P(\Omega) = 1\). 3. For mutually exclusive events \(E_1, E_2,\dots, E_k\): \(P\Bigl(\bigcup_{i=1}^k E_i\Bigr) = \sum_{i=1}^k P(E_i)\). ### 2.2.3 Conditional Probability \[P(A|B) = \frac{P(A \cap B)}{P(B)},\quad P(B) > 0\] - *Example*: If \(A\) = "sales > 150" and \(B\) = "weekend sales", compute the probability that sales exceed 150 given a weekend. ### 2.2.4 Independence - Events \(A\) and \(B\) are independent if \(P(A \cap B) = P(A)P(B)\). - *Practical check*: If marketing spend and customer churn are independent, the joint distribution factorises. ### 2.2.5 Random Variables - **Discrete**: Countable outcomes (e.g., number of purchases). - **Continuous**: Uncountable outcomes (e.g., revenue in dollars). - **Probability Mass Function (PMF)** for discrete: \(P(X=x)\). - **Probability Density Function (PDF)** for continuous: \(f_X(x)\) such that \(P(a < X < b) = \int_a^b f_X(x)dx\). --- ## 2.3 Hypothesis Testing | Concept | Description | |---------|-------------| | Null Hypothesis (H\_0) | Baseline assumption (e.g., no effect). | | Alternative Hypothesis (H\_1) | Contradicts H\_0 (e.g., effect exists). | | Significance Level (\(\alpha\)) | Threshold for rejecting H\_0 (commonly 0.05). | | Test Statistic | Function of sample data used to decide whether to reject H\_0. | | p‑value | Probability of observing data as extreme as, or more extreme than, the sample if H\_0 is true. | | Type I Error | Rejecting H\_0 when true (false positive). | | Type II Error | Failing to reject H\_0 when H\_1 is true (false negative). | ### 2.3.1 Common Tests | Test | When to Use | Key Assumptions | |------|-------------|-----------------| | One‑sample t‑test | Mean of a sample vs. known population mean | Normality, independent observations | | Two‑sample t‑test | Means of two independent groups | Normality, equal variances (or Welch's adjustment) | | Paired t‑test | Same subjects measured twice | Normality of differences | | Chi‑square goodness‑of‑fit | Observed counts vs. expected counts | Expected counts ≥ 5 | | Chi‑square test of independence | Contingency table analysis | Expected counts ≥ 5 | | ANOVA | Means across >2 groups | Normality, homoscedasticity | ### 2.3.2 Worked Example: One‑sample t‑test python import numpy as np from scipy import stats # Sample revenue for a new product launch revenue = np.array([150, 130, 110, 140, 160, 155, 145]) # Population mean revenue (industry benchmark) mu0 = 140 # t‑test t_stat, p_val = stats.ttest_1samp(revenue, mu0) print('t‑statistic:', t_stat) print('p‑value:', p_val) - *Decision*: If \(p < \alpha\) (e.g., 0.05), conclude that the new product’s revenue differs significantly from the benchmark. --- ## 2.4 Statistical Inference ### 2.4.1 Sampling Distribution - The distribution of a statistic (e.g., mean) across many random samples from the same population. - **Central Limit Theorem (CLT)**: For large \(n\), the sampling distribution of \(\bar{x}\) approaches normality, regardless of the population distribution. ### 2.4.2 Point Estimation - **Estimator**: Rule that maps data to a numerical value (e.g., sample mean as estimator of population mean). - **Properties**: Unbiasedness, efficiency, consistency. ### 2.4.3 Confidence Intervals - **Formula (Normal)**: \(\bar{x} \pm z_{\alpha/2}\frac{s}{\sqrt{n}}\). - **Interpretation**: We are \(100(1-\alpha)%\) confident that the interval contains the true parameter. ### 2.4.4 Margin of Error \[ME = z_{\alpha/2}\frac{s}{\sqrt{n}}\] - A concise way to communicate uncertainty. ### 2.4.5 Practical Example: Confidence Interval for Mean Revenue python # Sample data sales = np.array([150, 130, 110, 140, 160, 155, 145]) # Sample statistics n = len(sales) mean_sales = sales.mean() std_sales = sales.std(ddof=1) # 95% CI using t‑distribution alpha = 0.05 from scipy import stats # t critical value for n-1 df crit = stats.t.ppf(1 - alpha/2, df=n-1) margin = crit * std_sales / np.sqrt(n) ci_lower = mean_sales - margin ci_upper = mean_sales + margin print(f'95% CI for mean revenue: [{ci_lower:.2f}, {ci_upper:.2f}]') --- ## 2.5 Take‑Away Checklist - **Always summarize data first**: compute mean, median, SD, IQR, and plot. - **Understand the distribution**: Check skewness, outliers, and variance homogeneity before selecting a test. - **Match the hypothesis test to the data**: Pair‑wise vs. independent, categorical vs. continuous. - **Report both p‑values and effect sizes**: Significance alone can be misleading. - **Construct confidence intervals**: They provide a range of plausible values and a sense of precision. By mastering these fundamentals, you build a robust foundation that will support every subsequent chapter—from exploratory data analysis to deploying machine‑learning models in production. The next chapter will show how to translate raw data into actionable visual narratives.