Chapter 3: Building Reliable Data Pipelines

發布於 2026-03-02 15:27

# Chapter 3: Building Reliable Data Pipelines Data pipelines are the backbone of any analytics operation. They translate raw, heterogeneous sources into curated, query‑ready assets that power insights, reports, and models. In this chapter we walk through the full lifecycle of a pipeline—**acquisition → cleaning → transformation → storage**—and discuss the architectural choices that scale with business growth. ## 3.1 Data Acquisition Acquiring data is often the most complex step because it involves multiple source systems, varying formats, and unpredictable change rates. | Source | Typical Format | Integration Method | Typical Latency | |--------|----------------|--------------------|-----------------| | SQL DB | Structured | JDBC/ODBC | Minutes to hours | | REST API | JSON | HTTP client | Seconds to minutes | | IoT Sensor | CSV/NDJSON | Kafka producer | Milliseconds | | SaaS Platform | CSV/Excel | ETL connector | Hours | ### Best Practices 1. **Metadata‑driven connectors** – Store source schema, keys, and update frequency in a catalog. 2. **Incremental pulls** – Use timestamps or change‑data‑capture (CDC) to minimize data volume. 3. **Self‑service ingestion** – Build a UI for non‑technical users to define sources; this reduces engineering friction. 4. **Error monitoring** – Emit alerts on failed pulls and back‑fill on recovery. #### Sample Python CDC Pull python import requests, pandas as pd from datetime import datetime, timedelta last_run = datetime.utcnow() - timedelta(days=1) url = f"https://api.example.com/orders?since={last_run.isoformat()}" response = requests.get(url, headers={'Authorization': 'Bearer TOKEN'}) orders = pd.json_normalize(response.json()['data']) # Persist to staging orders.to_parquet('/staging/orders.parquet') ## 3.2 Data Cleaning & Validation Raw data often contains **noise, missing values, and schema drift**. Cleaning transforms it into a trustworthy dataset. | Problem | Typical Fix | Tooling | |---------|-------------|---------| | Nulls | Impute / Drop | pandas, Dask | | Outliers | Winsorize / Remove | scikit‑learn, PyOD | | Inconsistent types | Cast & Normalize | SQL, Spark | | Duplicate rows | Dedup by key | SQL, Pandas `drop_duplicates` | ### Validation Layers 1. **Schema validation** – Use JSON Schema or Apache Avro for structural checks. 2. **Constraint validation** – Enforce business rules (e.g., `order_total >= 0`). 3. **Data quality metrics** – Track completeness, accuracy, timeliness, and uniqueness. #### Validation with Great Expectations python import great_expectations as ge df = ge.read_dataset('/staging/orders.parquet') # Expect no nulls in key columns df.expect_column_values_to_not_be_null('order_id') # Expect order_total positive df.expect_column_values_to_be_greater_than('order_total', 0) # Save results df.save_expectation_suite('order_validation.yml') ## 3.3 Transformation & Enrichment Transformations turn cleaned data into analytical models, often applying **business logic** or **feature engineering**. ### Common Transformation Patterns - **Denormalization** – Flatten relational data for analytics engines. - **Feature Engineering** – Create lagged variables, rolling aggregates, or categorical embeddings. - **Data Merging** – Join datasets across domains (e.g., sales + marketing). - **Window Functions** – Compute running totals or moving averages. #### Example: Customer Lifetime Value (CLV) sql WITH order_sums AS ( SELECT customer_id, SUM(order_total) AS lifetime_spend, COUNT(*) AS order_count, MAX(order_date) AS last_order FROM prod.orders GROUP BY customer_id ), churn AS ( SELECT customer_id, CASE WHEN last_order < CURRENT_DATE - INTERVAL '30' DAY THEN 1 ELSE 0 END AS churned FROM order_sums ) SELECT o.customer_id, o.lifetime_spend, o.order_count, c.churned FROM order_sums o JOIN churn c USING (customer_id); ## 3.4 Storage & Data Warehouse Design Choosing the right storage layer determines query performance, cost, and scalability. | Storage Layer | Use‑Case | Typical Tools | Cost Model | |---------------|----------|---------------|------------| | Raw Staging | Raw, untransformed | S3, GCS, Azure Blob | Object storage | Low | | Curated Layer | Business‑ready, governed | Snowflake, BigQuery, Redshift | Columnar | Pay per query | | Data Lake | Raw + semi‑structured | Delta Lake, Iceberg | Tiered storage | Mid | | Operational DB | OLTP | PostgreSQL, MySQL, CockroachDB | Row‑based | High for scale | ### Architecture Patterns - **Lambda** – Batch + streaming for real‑time + historical. - **Kappa** – Single streaming pipeline re‑processing. - **Monolith** – All in one data warehouse (simplicity for small orgs). #### Schema Design Tip: Star vs. Snowflake - **Star** – Simplifies queries but may have redundancy. - **Snowflake** – Normalizes dimensions to reduce storage. ## 3.5 Version Control for Data Just like code, data benefits from versioning. | Approach | Pros | Cons | |----------|------|------| | **Data Version Control (DVC)** | Tracks lineage, reproducibility | Requires extra tooling | | **Git Large File Storage (LFS)** | Simple to adopt | Not ideal for petabyte scale | | **Time‑Series Snapshots** | Easy rollback | Can be storage‑intensive | #### DVC Workflow bash # Stage raw data python ingest.py --output data/raw/orders.csv # Add to DVC git add data/raw/orders.csv dvc add data/raw/orders.csv dvc commit -m "Add initial orders data" git push dvc push ## 3.6 Monitoring & Alerting Data pipelines are only useful if they are trustworthy. Continuous monitoring ensures early detection of drift or failures. | Metric | Source | Alert Trigger | |--------|--------|--------------| | Data Volume | Pipeline log | > 20% deviation from baseline | | Latency | Processing time | > 5x average | | Data Quality | Validation results | Any validation failure | | Schema Drift | Schema registry | Any change in mandatory fields | #### Example: Airflow DAG with Sensors python from airflow import DAG from airflow.operators.python_operator import PythonOperator from airflow.sensors.http_sensor import HttpSensor with DAG('orders_pipeline', start_date='2023-01-01', schedule_interval='@daily') as dag: wait_for_api = HttpSensor( task_id='wait_for_api', http_conn_id='orders_api', endpoint='health', response_check=lambda response: response.json()['status'] == 'ok' ) ingest = PythonOperator(task_id='ingest', python_callable=ingest_orders) wait_for_api >> ingest ## 3.7 Scalability Considerations | Challenge | Solution | Trade‑Off | |-----------|----------|-----------| | Volume spike | Auto‑scaling clusters (EKS, Databricks) | Cost overhead | | Complex transformations | Spark/Databricks | Learning curve | | Multi‑region compliance | Data lake with region‑specific buckets | Latency | | Data freshness | Incremental loading + streaming | Complexity | ### Cloud‑Native Example: Snowflake Streams Snowflake’s *Streams* allow near‑real‑time ingestion into a fact table while retaining historical lineage. sql -- Create a stream over the raw orders table CREATE OR REPLACE STREAM raw_orders_stream ON TABLE raw_orders; -- Load new data into the fact table MERGE INTO fact_orders f USING raw_orders_stream s ON f.order_id = s.order_id WHEN NOT MATCHED THEN INSERT (order_id, amount, ts) VALUES (s.order_id, s.amount, s.ts); ## 3.8 Summary | Step | Key Takeaway | |------|--------------| | Acquisition | Use metadata‑driven connectors and incremental pulls. | | Cleaning | Apply layered validation; automate with tools like Great Expectations. | | Transformation | Build reusable, parameterised pipelines; leverage window functions. | | Storage | Choose the right layer (raw, curated, lake) and architecture pattern. | | Versioning | Track data lineage with DVC or similar. | | Monitoring | Set up alerts on volume, latency, quality, and schema drift. | | Scalability | Plan for auto‑scaling, incremental loads, and cloud native features. | By embedding these principles into every engineering decision—grounded in measurable business impact—you’ll create pipelines that not only run reliably but also accelerate the delivery of value across the organization.

Chapter 2 – Building the Data Engine: From Raw Streams to Structured Insights

Chapter 4: From Features to Models – Engineering Excellence in Analytics