Chapter 2: Data Collection & Cleaning

發布於 2026-02-27 12:16

# Chapter 2: Data Collection & Cleaning > **Data is the foundation of insight.** In this chapter we explore how to acquire high‑quality data, assess its integrity, and cleanse it so that downstream analysis is trustworthy and reproducible. --- ## 2.1 Why Data Collection Matters * **Decision accuracy** – Poorly collected data can bias outcomes. * **Reproducibility** – Transparent acquisition processes enable audit trails. * **Regulatory compliance** – Data lineage is often required for GDPR, CCPA, and industry‑specific standards. > *Key takeaway*: Treat data acquisition as the first layer of the data‑science pipeline, just as important as modeling or deployment. --- ## 2.2 Identifying Data Sources | Source Type | Typical Examples | Typical Volume | Common Challenges | |-------------|------------------|----------------|-------------------| | Internal | ERP, CRM, BI logs | Medium‑Large | Data silos, inconsistent schemas | | External | APIs (Twitter, OpenWeather), Public datasets (Kaggle, Census), Web scraping | Small‑Large | Rate limits, legal restrictions | | Sensor | IoT devices, loggers | Very Large | High‑frequency noise, time‑sync issues | | Survey | Customer satisfaction, employee engagement | Medium | Low response rates, sampling bias | ### 2.2.1 Mapping Business Questions to Data Needs 1. **Define the objective** – e.g., *Increase customer retention*. 2. **Identify key metrics** – churn rate, lifetime value. 3. **List data requirements** – demographic tables, transaction logs, support tickets. 4. **Validate availability** – confirm data exists, access rights, refresh cadence. --- ## 2.3 Data Acquisition Techniques | Technique | Use‑Case | Pros | Cons | |-----------|----------|------|------| | SQL queries | Structured data extraction | Precise, reproducible | Requires schema knowledge | | API clients | Real‑time or bulk data | Automated, scalable | Rate limits, authentication overhead | | Web scraping | Unstructured public data | Flexibility | Legal risks, HTML changes | | ETL pipelines | Enterprise data integration | Centralized, traceable | Complex tooling, latency | | Direct uploads | Ad hoc data from stakeholders | Simple | Potential for duplicates, incomplete metadata | ### 2.3.1 Example: Pulling Customer Data from a REST API python import requests import pandas as pd API_ENDPOINT = "https://api.example.com/v1/customers" API_KEY = "YOUR_API_KEY" headers = {"Authorization": f"Bearer {API_KEY}"} def fetch_page(page): response = requests.get(API_ENDPOINT, params={"page": page}, headers=headers) response.raise_for_status() return response.json() # Pull all pages customers = [] page = 1 while True: data = fetch_page(page) customers.extend(data["results"]) if not data["next"]: break page += 1 df = pd.DataFrame(customers) print(df.head()) --- ## 2.4 Assessing Data Quality ### 2.4.1 Core Quality Dimensions | Dimension | Definition | |-----------|------------| | **Accuracy** | Closeness to the true value | | **Completeness** | Proportion of missing or null entries | | **Consistency** | Agreement across datasets or fields | | **Timeliness** | Currency of the data relative to the analysis horizon | | **Validity** | Conformance to business rules or data types | | **Uniqueness** | Absence of duplicates | ### 2.4.2 Quality Check Checklist text 1. Are column data types aligned with expectations? 2. Do any fields violate business rules (e.g., negative ages)? 3. Are there missing values beyond acceptable thresholds? 4. Are date/time stamps in a consistent timezone? 5. Are foreign keys matching reference tables? 6. Are there duplicate rows or keys? 7. Does the data distribution match historical patterns? --- ## 2.5 Data Cleaning Techniques | Problem | Typical Remedy | Example Code | |---------|----------------|--------------| | Missing values | Imputation (mean, median, mode, KNN) | `df['age'].fillna(df['age'].median(), inplace=True)` | | Outliers | Winsorization, trimming | `df['sales'] = df['sales'].clip(lower=0, upper=df['sales'].quantile(0.99))` | | Inconsistent formats | Normalization (e.g., phone numbers) | `df['phone'] = df['phone'].str.replace(r'\D', '', regex=True)` | | Duplicate records | Deduplication by key | `df.drop_duplicates(subset=['customer_id'], keep='first', inplace=True)` | | Categorical encoding | One‑hot, target, ordinal | `pd.get_dummies(df, columns=['color'])` | | Data type conversion | `pd.to_datetime`, `astype` | `df['date'] = pd.to_datetime(df['date'])` | ### 2.5.1 Advanced Cleaning: Handling Mixed‑Type Columns python # Example: a column containing both numeric and text ratings import numpy as np def clean_rating(val): try: return float(val) except ValueError: return np.nan df['rating_clean'] = df['rating_raw'].apply(clean_rating) --- ## 2.6 Validation & Testing | Validation Type | Tool | Purpose | |-----------------|------|---------| | Schema validation | Great Expectations | Ensure columns, data types, constraints | | Unit tests | PyTest, unittest | Verify cleaning functions on test cases | | Integration tests | Airflow DAG tests | Confirm pipeline flows correctly | | Data drift monitoring | Evidently, Deequ | Detect distribution shifts | ### 2.6.1 Sample Great Expectations Checklist yaml data_asset_name: customers dataset: customers.csv expectation_suite_name: customers_validation expectations: - expectation_type: expect_column_values_to_not_be_null kwargs: column: customer_id - expectation_type: expect_column_values_to_be_of_type kwargs: column: age type_: - int - expectation_type: expect_column_values_to_be_in_set kwargs: column: country value_set: - US - UK - CA --- ## 2.7 Documentation & Data Lineage | Document Component | Description | |---------------------|-------------| | Data Dictionary | Column names, types, meanings | | Acquisition Logs | Source, timestamps, frequency | | Cleaning Scripts | Versioned code, rationale | | Quality Reports | Summary of defects, corrections | | Lineage Diagram | Data flow from source to analytical model | > **Tip**: Use a **Data Catalog** (e.g., Alation, Collibra) to centralize metadata and keep stakeholders informed. --- ## 2.8 Ethical & Legal Considerations 1. **Consent & Purpose Limitation** – Ensure data was collected with explicit consent for the intended use. 2. **Data Minimization** – Collect only what is necessary. 3. **Anonymization & Pseudonymization** – Remove or mask personally identifiable information (PII). 4. **Compliance** – Adhere to GDPR, CCPA, HIPAA, PCI‑DSS where applicable. 5. **Bias Detection** – Review sampling frames for representation bias. 6. **Transparency** – Maintain clear audit trails for data lineage. --- ## 2.9 Tools & Platforms Overview | Category | Tool | Strength | Typical Use | |----------|------|----------|-------------| | Data Ingestion | Apache NiFi, Talend | Drag‑and‑drop, streaming | Real‑time pipelines | | ETL | dbt, Pentaho | SQL‑centric transformations | Data warehouse prep | | Data Quality | Great Expectations, Deequ | Assertions, drift detection | Validation | | Storage | Snowflake, Redshift, BigQuery | Scalable, columnar | Raw & cleaned data | | Orchestration | Airflow, Prefect | Workflow scheduling | End‑to‑end pipelines | | Version Control | Git, DVC | Code & data lineage | Reproducibility | --- ## 2.10 Checklist for a Clean Data Foundation 1. **Source mapping** – All sources documented and agreed upon. 2. **Acquisition scripts** – Versioned, automated, and tested. 3. **Schema validation** – Enforced at ingestion. 4. **Quality monitoring** – Automated alerts on drift or defects. 5. **Cleaning pipelines** – Modular, idempotent, and logged. 6. **Metadata catalog** – Up‑to‑date data dictionary and lineage. 7. **Security controls** – Encryption, role‑based access, audit logs. 8. **Compliance audit** – Regular reviews against regulatory frameworks. --- ## 2.11 Summary * Data collection is a disciplined activity that defines the scope and reliability of any analysis. * Quality assessment and cleaning transform raw inputs into trustworthy evidence. * Automation, validation, and documentation are essential to scale these processes and maintain governance. * Ethical stewardship safeguards both the organization and the individuals whose data fuels insight. > *In the next chapter we will dive into exploratory data analysis, turning clean data into actionable patterns.*

Chapter 1: Foundations of Data Science

Chapter 3: Exploratory Data Analysis