Chapter 3: From Raw to Ready – Data Acquisition & Cleansing

發布於 2026-03-03 21:37

# Chapter 3 ## From Raw to Ready – Data Acquisition & Cleansing Data is the currency of analytics, but raw data rarely comes in a clean, usable form. This chapter turns the focus from *what* we want to *how* we get the data and *prepare* it for the next stages of the science lifecycle. We emphasize rigorous procedures, reproducibility, and the principle that a solid foundation reduces downstream errors. --- ### 1. The Acquisition Landscape | Source | Typical Format | Common Pitfalls | Mitigation Strategy | |--------|----------------|-----------------|---------------------| | APIs | JSON, XML, CSV | Rate limits, pagination | Token rotation, retry logic | | Databases | SQL tables | Schema drift, nullability | Schema registry, versioned snapshots | | Files | CSV, Parquet, Excel | Mixed delimiters, hidden BOM | Standardized ingest pipeline | | Web Scraping | HTML, JSON | CAPTCHAs, anti‑scraping measures | User‑agent rotation, headless browsers | *Key Takeaway:* Always **document** the source, access method, and any transformation that occurs during ingestion. --- ### 2. Building an Ingestion Pipeline 1. **Metadata Registry** – Capture source, schema, and version. 2. **Connection Layer** – Abstract the provider via a driver or wrapper. 3. **Batch vs. Stream** – Decide on latency and volume constraints. 4. **Error Handling** – Log failures with context for audit. 5. **Storage** – Prefer immutable, append‑only blobs for raw data; use a data lake for raw, a data warehouse for cleaned. python import pandas as pd from pathlib import Path # Example: Read CSV from S3 bucket via a cloud‑agnostic wrapper raw_path = Path("s3://my-bucket/raw-data/2024-03-01_sales.csv") raw_df = pd.read_csv(raw_path) --- ### 3. Data Quality Dimensions | Dimension | Definition | Validation Technique | |-----------|------------|----------------------| | **Completeness** | Absence of missing values | `df.isnull().mean()` | | **Consistency** | Uniform formats across columns | Regex checks, dtype enforcement | | **Accuracy** | Alignment with external reference | Cross‑validation, sanity‑checks | | **Uniqueness** | No duplicate records | `df.duplicated().sum()` | | **Timeliness** | Data is recent | Timestamp checks, lag analysis | *Remember:* Quality is a *measure*, not a *goal*. Quantify it, then set thresholds. --- ### 4. Cleaning Strategies | Task | Typical Operations | Tooling | |------|---------------------|---------| | **Missing Value Imputation** | Mean/median, KNN, regression | `sklearn.impute` | | **Outlier Detection** | IQR, z‑score, DBSCAN | `scipy.stats`, `sklearn.cluster` | | **Standardization** | Min‑Max, Z‑score | `sklearn.preprocessing` | | **Deduplication** | Exact match, fuzzy | `dedupe`, `fuzzywuzzy` | | **Transformation** | Log, Box‑Cox | `scipy.stats.boxcox` | python # Quick median imputation example median_vals = raw_df.median() clean_df = raw_df.fillna(median_vals) --- ### 5. Reproducible Cleaning Pipelines 1. **Notebook as Documentation** – Every transformation should be recorded. 2. **Scripted Pipelines** – Encapsulate logic in functions or classes. 3. **Version Control** – Store data, code, and configuration in a single repo. 4. **Data Provenance** – Keep lineage links: raw → cleaned → model. --- ### 6. Common Pitfalls & How to Avoid Them | Pitfall | Why it Matters | Prevention | |----------|----------------|------------| | **Over‑fitting to Cleaned Data** | Cleaned data may hide real variance | Use hold‑out data, sanity‑check distributions | | **Erosion of Raw Data** | Loss of source for audit | Immutable raw storage, checksum verification | | **Ignoring Domain Semantics** | Numeric encoding misinterprets categories | Leverage categorical encoding with domain knowledge | | **Neglecting Temporal Drift** | Models become stale | Periodic re‑acquisition, concept‑drift detection | --- ### 7. Quick‑Start Checklist - [ ] Source metadata captured - [ ] Ingestion pipeline in place - [ ] Data quality metrics logged - [ ] Cleaning scripts versioned - [ ] Data lineage tracked --- ### 8. A Real‑World Example **Scenario:** A retail chain wants to predict next‑quarter sales. Raw data comes from multiple legacy systems (POS, ERP, CRM) in disparate formats. By implementing the steps above, we consolidated 10 TB of data, reduced missingness from 18% to 1%, and achieved a 12% lift in model accuracy after cleaning. --- > *Pro Tip:* Treat every cleaning rule as a hypothesis. Test its effect on downstream metrics before committing it to the pipeline. --- ### 9. Take‑Away Data acquisition and cleaning are *not* a one‑time chore but a continuous discipline that safeguards the integrity of every analysis. A well‑structured ingestion pipeline, combined with rigorous quality checks, turns chaotic raw data into a reliable, reproducible foundation for insight. > *Remember:* The effort spent in cleaning often pays dividends that outweigh the complexity added later in the modeling stage.

Chapter 2 – Articulating the Problem: From Business Intuition to Statistical Question

Model Evaluation, Validation, and Metrics: Turning Predictions into Decisions