Chapter 2: Foundations of Data Acquisition

發布於 2026-03-06 20:12

## Chapter 2: Foundations of Data Acquisition Data science begins at the point where raw information touches a screen. This chapter walks you through the very first step of the pipeline – gathering the data that will eventually fuel models, dashboards, and decisions. --- ### 2.1 Why Acquisition Matters You could spend months building the most sophisticated model, only to discover that the data you fed it was noisy, incomplete, or biased. Acquisition is the gatekeeper; the quality, volume, and variety of data you start with largely dictate what you can achieve downstream. - **Speed** – Fast acquisition means faster experimentation. - **Scale** – Huge datasets unlock deep insights, but they also demand robust infrastructure. - **Freshness** – In many domains (e.g., finance, IoT) the data must be close to real‑time. - **Governance** – Proper acquisition embeds privacy, compliance, and provenance into the data from the start. A disciplined acquisition strategy reduces friction later and ensures that the downstream steps—cleaning, transformation, modeling—are built on a solid foundation. --- ### 2.2 Types of Data Sources | Source | Typical Format | Example Use‑Case | |--------|----------------|-----------------| | **Databases** | Structured (SQL/NoSQL) | Customer records, transactional logs | | **APIs** | JSON, XML, Protobuf | Social media feeds, weather stations | | **Files** | CSV, Parquet, Avro | Log archives, legacy exports | | **Sensors** | Binary, streaming | IoT devices, manufacturing equipment | | **Web** | HTML, PDFs | Scraping product listings, news articles | | **Third‑Party Data** | Proprietary formats | Market research, credit scores | Recognizing the format and velocity of your data source is the first decision that shapes the entire acquisition pipeline. --- ### 2.3 Acquisition Techniques #### 2.3.1 Pull vs. Push - **Pull** – You request data on demand (e.g., REST API calls). Useful for small, infrequent datasets. - **Push** – Data is sent to you automatically (e.g., Kafka, Webhooks). Ideal for high‑volume, real‑time streams. #### 2.3.2 Batch Extraction Batch jobs (CRON, Airflow DAGs) pull large volumes at scheduled intervals. They are straightforward to implement but introduce latency. #### 2.3.3 Streaming Ingestion Techniques like Apache Kafka, Amazon Kinesis, or Azure Event Hubs ingest data in near real‑time. They require a continuous pipeline and stateful processing but keep the data fresh. #### 2.3.4 Web Scraping For data that lives only on the web, libraries such as BeautifulSoup, Scrapy, or Selenium can automate the collection. Respect `robots.txt` and rate limits to stay compliant. #### 2.3.5 API Wrappers Wrap a third‑party API in a reusable library. Include error handling, retries, and pagination logic. This encapsulation protects downstream processes from API changes. --- ### 2.4 Data Quality and Governance at the Source Acquisition is not just about pulling data; it’s also about *capturing* it correctly. 1. **Schema Validation** – Ensure the data conforms to expected types and ranges before it enters your pipeline. 2. **Metadata Capture** – Store provenance (source, timestamp, extractor version) with every record. 3. **Compliance Checks** – Apply GDPR, CCPA, or industry‑specific rules during extraction (e.g., data masking). 4. **Rate Limiting & Back‑Off** – Avoid throttling by obeying API limits and implementing exponential back‑off. 5. **Monitoring & Alerting** – Track ingestion failures, latency, and data drift in real‑time. A robust acquisition layer should flag anomalies early, preventing a cascade of errors downstream. --- ### 2.5 Building the Acquisition Pipeline Below is a high‑level template you can adapt. It uses Python, Airflow, and Kafka, but the concepts are transferable. ```python # acquirer.py import requests import json import kafka class DataAcquirer: def __init__(self, api_url, api_key, kafka_topic): self.api_url = api_url self.headers = {'Authorization': f'Bearer {api_key}'} self.producer = kafka.KafkaProducer( bootstrap_servers='broker1:9092', value_serializer=lambda v: json.dumps(v).encode('utf-8')) self.topic = kafka_topic def fetch(self): response = requests.get(self.api_url, headers=self.headers) response.raise_for_status() return response.json() def stream_to_kafka(self, data): for record in data: enriched = { 'payload': record, 'ingest_ts': int(time.time()), 'source': self.api_url } self.producer.send(self.topic, enriched) self.producer.flush() # airflow_dag.py from airflow import DAG from airflow.operators.python import PythonOperator from datetime import datetime from acquirer import DataAcquirer def run_acquisition(**context): acquirer = DataAcquirer( api_url='https://api.example.com/v1/data', api_key='YOUR_KEY', kafka_topic='raw_data_topic') raw = acquirer.fetch() acquirer.stream_to_kafka(raw) with DAG('daily_data_acquisition', start_date=datetime(2023, 1, 1), schedule_interval='@daily') as dag: acquisition_task = PythonOperator( task_id='acquire_data', python_callable=run_acquisition) ``` **Key Takeaways** - Use a *producer* to decouple the fetch logic from downstream consumers. - Store minimal but essential metadata with each record. - Keep the acquisition script idempotent to avoid duplicate ingestion. --- ### 2.6 Tooling Landscape | Tool | Purpose | Language | Notes | |------|---------|----------|-------| | **Airflow** | Orchestrate batch jobs | Python | Excellent for complex DAGs | | **Prefect** | Workflow management with a focus on observability | Python | Modern API, integrates well with cloud services | | **Kafka** | Streaming platform | Java/Scala | Durable, low‑latency ingestion | | **Debezium** | Change Data Capture (CDC) from databases | Java | Turns database changes into Kafka events | | **Scrapy** | Web scraping | Python | Built‑in concurrency | | **PySpark** | Process big batch data | Python/Scala | Handles distributed data at scale | | **Great Expectations** | Data validation & testing | Python | Declarative expectation suites | Choose the right mix based on volume, latency, and team expertise. --- ### 2.7 Real‑World Case Studies #### 2.7.1 Retail Price Optimization A large retailer needed to monitor competitor prices in real‑time. Using a combination of **web scraping** (Scrapy) and **API pulls** (price feeds), they built a Kafka pipeline that ingested >1 M records per hour. The ingestion layer performed schema validation with Great Expectations, tagging any anomalies for manual review. #### 2.7.2 Smart Manufacturing An automotive plant used **IoT sensors** streaming to Azure Event Hubs. A custom Python consumer parsed binary telemetry and stored it in Azure Data Lake. Metadata captured the device ID, timestamp, and firmware version, enabling traceability for compliance audits. #### 2.7.3 Financial Fraud Detection A bank aggregated transaction logs from multiple legacy databases using **Debezium CDC** into Kafka topics. The ingestion pipeline enriched each record with geolocation data from a third‑party API and forwarded the data to an online Spark Structured Streaming job that flagged anomalies in real‑time. --- ### 2.8 Common Pitfalls & How to Avoid Them | Pitfall | Symptom | Mitigation | |----------|---------|------------| | **Inadequate Schema Evolution** | Data format changes break downstream jobs | Implement schema registry (e.g., Confluent Schema Registry) and enforce strict versioning | | **Ignoring Latency** | Real‑time signals arrive late, diminishing value | Use streaming ingestion and monitor latency dashboards | | **Poor Error Handling** | Failed pulls cause silent data gaps | Add retries with exponential back‑off and alert on persistent failures | | **Over‑Fetching** | Unnecessary API calls consume limits | Cache responses, respect pagination, and implement throttling | | **Data Silos** | Data resides in disparate systems with no unified metadata | Adopt a central metadata store (e.g., Amundsen) and unify ingestion points | --- ### 2.9 Summary - Acquisition is the **gateway** to all downstream analytics. - Understand your source landscape and choose the right *pull* or *push* strategy. - Embed quality checks, governance, and provenance at the source. - Build modular, idempotent pipelines that can scale with data volume. - Leverage the rich ecosystem of tools—Airflow, Kafka, Debezium, Scrapy, etc.—to suit your specific needs. In the next chapter, we’ll transition from raw, noisy streams to the **cleaning** phase, turning imperfect data into reliable assets ready for exploration.

Chapter 1: The Data Science Landscape

Chapter 3: Cleaning the Chaos – From Raw Noise to Reliable Assets