返回目錄
A
Data Intelligence: From Foundations to Applications - 第 2 章
Chapter 2: Foundations of Data Analytics
發布於 2026-02-27 18:03
# Chapter 2: Foundations of Data Analytics
In this chapter we lay the groundwork for all subsequent analysis. Understanding the statistical concepts that underpin inference, along with a disciplined approach to data types, quality, and cleaning, ensures that every model built on top of this foundation is reliable, reproducible, and insightful.
---
## 1. Statistical Fundamentals
| Concept | Definition | Practical Relevance |
|---------|------------|---------------------|
| **Distribution** | The probability law that describes how values of a random variable are spread. | Guides the choice of models and tests; e.g., normality assumption in linear regression.
| **Hypothesis Testing** | Formal framework to decide if evidence supports a claim about a population. | Helps validate business decisions (A/B tests, policy changes).
| **Confidence Interval** | Range of values that likely contain the true parameter with a specified probability. | Provides uncertainty estimates for estimates, crucial for risk management.
### 1.1 Probability Distributions
| Distribution | Typical Use‑Cases | Key Parameters |
|--------------|------------------|----------------|
| Normal | Continuous, symmetric data (e.g., heights, test scores) | \\mu (mean), \\sigma^2 (variance) |
| Binomial | Number of successes in fixed trials (e.g., click‑throughs) | n (trials), p (success probability) |
| Poisson | Count of events in a fixed interval (e.g., call arrivals) | \\lambda (rate) |
| Exponential | Time between events (e.g., failure times) | \\lambda (rate) |
**Practical tip:** Visualize the empirical distribution using histograms or density plots before fitting a parametric model. In Python, `seaborn.kdeplot` or `scipy.stats` provide quick diagnostics.
### 1.2 Hypothesis Testing Basics
1. **State null (H₀) and alternative (H₁) hypotheses**.
2. **Choose significance level** (α, common values 0.05 or 0.01).
3. **Select test statistic** (t, z, χ², etc.).
4. **Compute p‑value** and compare to α.
5. **Make a decision**: reject or fail to reject H₀.
**Example (Python):**
```python
import scipy.stats as stats
# One‑sample t‑test: does mean of sample differ from 50?
sample = [52, 47, 53, 49, 51, 54, 48]
stat, p = stats.ttest_1samp(sample, 50)
print(f"t‑statistic={stat:.3f}, p‑value={p:.3f}")
```
### 1.3 Confidence Intervals
A 95 % CI for a population mean µ is:
\[
\bar{x} \pm 1.96 \times \frac{s}{\sqrt{n}}
\]
Where `\bar{x}` is the sample mean, `s` the sample standard deviation, and `n` the sample size.
**Practical tip:** Use `statsmodels` or `scipy.stats` to compute CIs for various parameters automatically.
---
## 2. Data Types, Quality, and Cleaning Basics
| Data Type | Typical Representation | Common Challenges |
|-----------|------------------------|-------------------|
| **Numerical** (int, float) | Continuous or discrete numbers | Missing values, outliers, scaling |
| **Categorical** (string, category) | Labels or tokens | High cardinality, inconsistent labels |
| **Datetime** | Timestamps, dates | Time zone mismatches, irregular intervals |
| **Text** | Free‑form strings | Noise, varying encoding |
### 2.1 Data Quality Dimensions
| Dimension | Description | Typical Remedies |
|-----------|-------------|------------------|
| **Accuracy** | Closeness to true value | Validation against reference, manual checks |
| **Completeness** | Proportion of missing entries | Imputation, data augmentation |
| **Consistency** | Harmonized schema and formats | Standardization, canonicalization |
| **Uniqueness** | Absence of duplicates | Deduplication, primary key enforcement |
| **Timeliness** | Currency of data | Regular refresh, version control |
### 2.2 Cleaning Workflow
1. **Load & Inspect**: `df.head()`, `df.info()`, `df.describe()`.
2. **Detect Missingness**: `df.isnull().sum()`.
3. **Handle Missing Data**:
* Drop: `df.dropna()`.
* Impute numeric: `df.fillna(df.mean())`.
* Impute categorical: mode or a placeholder.
4. **Detect Outliers**: boxplots, IQR, Z‑score.
5. **Transform & Encode**:
* Standardization: `StandardScaler`.
* Normalization: `MinMaxScaler`.
* One‑hot / target encoding for categorical.
6. **Validate**: re‑run `df.describe()` to confirm.
**Code Snippet – Outlier Removal using IQR**
```python
import numpy as np
def remove_outliers_iqr(df, column, k=1.5):
Q1 = df[column].quantile(0.25)
Q3 = df[column].quantile(0.75)
IQR = Q3 - Q1
lower_bound = Q1 - k * IQR
upper_bound = Q3 + k * IQR
return df[(df[column] >= lower_bound) & (df[column] <= upper_bound)]
# Example usage
clean_df = remove_outliers_iqr(df, "age")
```
### 2.3 Data Profiling Tools
| Tool | Language | Strength |
|------|----------|----------|
| `pandas_profiling` | Python | Generates interactive HTML reports in seconds |
| `sweetviz` | Python | Visual dashboards for comparison |
| `DataCleaner` | R | Comprehensive cleaning and imputation pipelines |
| `Dataiku` | Enterprise | Drag‑and‑drop with automated profiling |
**Practical tip:** Run a quick profiling report at the start of a project to surface hidden data quality issues before investing in complex modeling.
---
## 3. Putting It Together – A Mini‑Project
| Step | Task | Tools |
|------|------|-------|
| 1 | Define business question | Jupyter Notebook |
| 2 | Load data | `pandas.read_csv` |
| 3 | Explore & profile | `pandas_profiling`, `seaborn` |
| 4 | Clean & transform | `pandas`, `scikit-learn` scalers |
| 5 | Statistical tests | `scipy.stats` |
| 6 | Report results | Markdown, Matplotlib |
**Deliverable:** A notebook that walks the reader from raw CSV to a confidence interval for a key metric, ready for stakeholder presentation.
---
## 4. Take‑Away Checklist
- [ ] Verify data types before modeling.
- [ ] Run a profiling report to uncover hidden issues.
- [ ] Document all cleaning steps in version‑controlled scripts.
- [ ] Use statistical tests to support decisions.
- [ ] Communicate uncertainty via confidence intervals.
By mastering these fundamentals, you create a resilient pipeline that transforms messy data into trustworthy insights. In the next chapter, we’ll learn how to visualize those insights to uncover deeper patterns and drive action.