返回目錄
A
Data Science for Strategic Decision-Making: A Practical Guide - 第 2 章
Chapter 2: Data Acquisition – From Need to Pipeline
發布於 2026-03-03 18:00
# Chapter 2: Data Acquisition – From Need to Pipeline
In this chapter we move from the abstract “what” of the problem to the concrete “how” of collecting the data that will fuel our models. A robust acquisition strategy is the foundation on which the entire data‑science lifecycle is built.
## 1. Map the Data Landscape
1. **Identify the Source Landscape** – List all potential data stores (CRM, ERP, IoT, third‑party APIs, web services). Create a *Data Source Inventory* table:
| Source | Type | Owner | Frequency | Access Method |
|--------|------|-------|-----------|---------------|
2. **Define Data Quality Expectations** – For each source, note
* Latency tolerance
* Accuracy/Completeness metrics
* Schema stability
3. **Set Governance Boundaries** – Who can read/write? What privacy or compliance rules apply?
> **Tip**: Keep the inventory version‑controlled (e.g., in a lightweight JSON repo). It becomes the living contract between data owners and the analytics team.
## 2. Formalize Data Contracts
A *Data Contract* is a lightweight API‑style specification that declares:
| Element | Description |
|---------|-------------|
| Endpoint | URL or file path |
| Schema | JSON Schema / Avro |
| Frequency | Batch, stream, on‑demand |
| SLA | Latency, MTTR |
| Validation Rules | Checksums, value ranges |
Use tools like **Apigee** or **OpenAPI** for REST contracts, or **Confluent Schema Registry** for Kafka topics.
## 3. Choose the Right Ingestion Architecture
| Architecture | When to Use | Typical Tools |
|--------------|-------------|---------------|
| **Batch** | Periodic reporting, nightly jobs | Airflow, Prefect, dbt |
| **Streaming** | Real‑time analytics, fraud detection | Kafka, Kinesis, Flink |
| **Hybrid** | Mixed workloads | Lambda + S3 + Glue |
Design the pipeline with **de‑duplication** and **idempotency** in mind. A single source can push duplicate events; your pipeline must handle that gracefully.
## 4. Build the Data Pipeline
1. **Extract** – Use connectors (e.g., JDBC for SQL, S3 for files, SDKs for APIs). Prefer *pull* mechanisms when the data provider is stable; use *push* for high‑velocity streams.
2. **Transform** – Apply ETL or ELT steps:
* **Schema Alignment** – Map legacy fields to new names.
* **Data Cleansing** – Null handling, outlier filtering.
* **Feature Engineering** – Compute rolling aggregates, join historical snapshots.
3. **Load** – Store in a *Data Lake* (S3, ADLS) or a *Data Warehouse* (Snowflake, BigQuery). Use partitioning by date or business key to accelerate downstream queries.
python
# Sample Airflow DAG snippet
from airflow import DAG
from airflow.providers.amazon.aws.operators.s3_copy_object import S3CopyObjectOperator
from datetime import datetime
with DAG('daily_sales_ingest', start_date=datetime(2024, 1, 1)) as dag:
copy_task = S3CopyObjectOperator(
task_id='copy_raw_sales',
source_bucket='raw-sales',
dest_bucket='processed-sales',
source_key='{{ ds }}/sales.csv',
dest_key='{{ ds }}/sales.parquet'
)
## 5. Monitor and Alert
- **Metadata Capture** – Log source timestamps, record counts, and hash checksums.
- **Health Checks** – Use Prometheus metrics for pipeline uptime and error rates.
- **Alerting** – Trigger Slack/Email alerts when deviations exceed thresholds (e.g., >10% drop in record count).
> **Pro tip**: Store the pipeline health status in a small, queryable table (`pipeline_status`) so business users can check pipeline health without engineering help.
## 6. Validate and Iterate
1. **Sample Audits** – Spot‑check a subset of records against the source.
2. **Data Quality Scores** – Compute completeness, uniqueness, and consistency metrics automatically.
3. **Feedback Loop** – When a downstream model flags data anomalies, route that back to the ingestion team to refine the contract.
## 7. Documentation & Knowledge Transfer
- Keep the **Data Dictionary** up‑to‑date: field names, types, meanings.
- Record the *why* behind each transformation: business rationale, source quirks.
- Share a **Pipeline Playbook** (Markdown + diagrams) in the team's wiki.
## 8. Real‑World Scenario
> **Scenario**: A retail chain wants to predict next‑month sales per store.
>
> **Data Sources**:
> * POS systems (streaming via Kafka)
> * Weather API (daily REST calls)
> * Store‑level metadata (static CSV in S3)
>
> **Pipeline**:
> 1. Kafka stream → Spark Structured Streaming → clean → windowed aggregates
> 2. Weather API → Lambda → S3 (JSON) → Glue crawler → Athena
> 3. Store metadata → Glue job → Athena
> 4. Combine all in a **Feature Store** (Feast) for model consumption.
>
> **Result**: A daily feature refresh that powers a nightly model run.
## 9. Common Pitfalls
| Pitfall | Fix |
|---------|-----|
| Over‑engineering connectors | Start with a minimal adapter, iterate later |
| Ignoring data drift | Schedule periodic schema validation and anomaly detection |
| Relying on manual data pulls | Automate via CI/CD pipelines and orchestrators |
---
**Next**: Statistical analysis – turning clean data into insights that inform the next strategic decision.