Chapter 2: Data Acquisition – From Need to Pipeline

發布於 2026-03-03 18:00

# Chapter 2: Data Acquisition – From Need to Pipeline In this chapter we move from the abstract “what” of the problem to the concrete “how” of collecting the data that will fuel our models. A robust acquisition strategy is the foundation on which the entire data‑science lifecycle is built. ## 1. Map the Data Landscape 1. **Identify the Source Landscape** – List all potential data stores (CRM, ERP, IoT, third‑party APIs, web services). Create a *Data Source Inventory* table: | Source | Type | Owner | Frequency | Access Method | |--------|------|-------|-----------|---------------| 2. **Define Data Quality Expectations** – For each source, note * Latency tolerance * Accuracy/Completeness metrics * Schema stability 3. **Set Governance Boundaries** – Who can read/write? What privacy or compliance rules apply? > **Tip**: Keep the inventory version‑controlled (e.g., in a lightweight JSON repo). It becomes the living contract between data owners and the analytics team. ## 2. Formalize Data Contracts A *Data Contract* is a lightweight API‑style specification that declares: | Element | Description | |---------|-------------| | Endpoint | URL or file path | | Schema | JSON Schema / Avro | | Frequency | Batch, stream, on‑demand | | SLA | Latency, MTTR | | Validation Rules | Checksums, value ranges | Use tools like **Apigee** or **OpenAPI** for REST contracts, or **Confluent Schema Registry** for Kafka topics. ## 3. Choose the Right Ingestion Architecture | Architecture | When to Use | Typical Tools | |--------------|-------------|---------------| | **Batch** | Periodic reporting, nightly jobs | Airflow, Prefect, dbt | | **Streaming** | Real‑time analytics, fraud detection | Kafka, Kinesis, Flink | | **Hybrid** | Mixed workloads | Lambda + S3 + Glue | Design the pipeline with **de‑duplication** and **idempotency** in mind. A single source can push duplicate events; your pipeline must handle that gracefully. ## 4. Build the Data Pipeline 1. **Extract** – Use connectors (e.g., JDBC for SQL, S3 for files, SDKs for APIs). Prefer *pull* mechanisms when the data provider is stable; use *push* for high‑velocity streams. 2. **Transform** – Apply ETL or ELT steps: * **Schema Alignment** – Map legacy fields to new names. * **Data Cleansing** – Null handling, outlier filtering. * **Feature Engineering** – Compute rolling aggregates, join historical snapshots. 3. **Load** – Store in a *Data Lake* (S3, ADLS) or a *Data Warehouse* (Snowflake, BigQuery). Use partitioning by date or business key to accelerate downstream queries. python # Sample Airflow DAG snippet from airflow import DAG from airflow.providers.amazon.aws.operators.s3_copy_object import S3CopyObjectOperator from datetime import datetime with DAG('daily_sales_ingest', start_date=datetime(2024, 1, 1)) as dag: copy_task = S3CopyObjectOperator( task_id='copy_raw_sales', source_bucket='raw-sales', dest_bucket='processed-sales', source_key='{{ ds }}/sales.csv', dest_key='{{ ds }}/sales.parquet' ) ## 5. Monitor and Alert - **Metadata Capture** – Log source timestamps, record counts, and hash checksums. - **Health Checks** – Use Prometheus metrics for pipeline uptime and error rates. - **Alerting** – Trigger Slack/Email alerts when deviations exceed thresholds (e.g., >10% drop in record count). > **Pro tip**: Store the pipeline health status in a small, queryable table (`pipeline_status`) so business users can check pipeline health without engineering help. ## 6. Validate and Iterate 1. **Sample Audits** – Spot‑check a subset of records against the source. 2. **Data Quality Scores** – Compute completeness, uniqueness, and consistency metrics automatically. 3. **Feedback Loop** – When a downstream model flags data anomalies, route that back to the ingestion team to refine the contract. ## 7. Documentation & Knowledge Transfer - Keep the **Data Dictionary** up‑to‑date: field names, types, meanings. - Record the *why* behind each transformation: business rationale, source quirks. - Share a **Pipeline Playbook** (Markdown + diagrams) in the team's wiki. ## 8. Real‑World Scenario > **Scenario**: A retail chain wants to predict next‑month sales per store. > > **Data Sources**: > * POS systems (streaming via Kafka) > * Weather API (daily REST calls) > * Store‑level metadata (static CSV in S3) > > **Pipeline**: > 1. Kafka stream → Spark Structured Streaming → clean → windowed aggregates > 2. Weather API → Lambda → S3 (JSON) → Glue crawler → Athena > 3. Store metadata → Glue job → Athena > 4. Combine all in a **Feature Store** (Feast) for model consumption. > > **Result**: A daily feature refresh that powers a nightly model run. ## 9. Common Pitfalls | Pitfall | Fix | |---------|-----| | Over‑engineering connectors | Start with a minimal adapter, iterate later | | Ignoring data drift | Schedule periodic schema validation and anomaly detection | | Relying on manual data pulls | Automate via CI/CD pipelines and orchestrators | --- **Next**: Statistical analysis – turning clean data into insights that inform the next strategic decision.

Chapter 1: Laying the Foundation – Understanding the Business Problem

Chapter 3: Exploratory Data Analysis & Visualization