返回目錄
A
Data Science Unveiled: From Raw Data to Insightful Decisions - 第 2 章
Chapter 2: Foundations of Data Acquisition
發布於 2026-03-06 20:12
## Chapter 2: Foundations of Data Acquisition
Data science begins at the point where raw information touches a screen. This chapter walks you through the very first step of the pipeline – gathering the data that will eventually fuel models, dashboards, and decisions.
---
### 2.1 Why Acquisition Matters
You could spend months building the most sophisticated model, only to discover that the data you fed it was noisy, incomplete, or biased. Acquisition is the gatekeeper; the quality, volume, and variety of data you start with largely dictate what you can achieve downstream.
- **Speed** – Fast acquisition means faster experimentation.
- **Scale** – Huge datasets unlock deep insights, but they also demand robust infrastructure.
- **Freshness** – In many domains (e.g., finance, IoT) the data must be close to real‑time.
- **Governance** – Proper acquisition embeds privacy, compliance, and provenance into the data from the start.
A disciplined acquisition strategy reduces friction later and ensures that the downstream steps—cleaning, transformation, modeling—are built on a solid foundation.
---
### 2.2 Types of Data Sources
| Source | Typical Format | Example Use‑Case |
|--------|----------------|-----------------|
| **Databases** | Structured (SQL/NoSQL) | Customer records, transactional logs |
| **APIs** | JSON, XML, Protobuf | Social media feeds, weather stations |
| **Files** | CSV, Parquet, Avro | Log archives, legacy exports |
| **Sensors** | Binary, streaming | IoT devices, manufacturing equipment |
| **Web** | HTML, PDFs | Scraping product listings, news articles |
| **Third‑Party Data** | Proprietary formats | Market research, credit scores |
Recognizing the format and velocity of your data source is the first decision that shapes the entire acquisition pipeline.
---
### 2.3 Acquisition Techniques
#### 2.3.1 Pull vs. Push
- **Pull** – You request data on demand (e.g., REST API calls). Useful for small, infrequent datasets.
- **Push** – Data is sent to you automatically (e.g., Kafka, Webhooks). Ideal for high‑volume, real‑time streams.
#### 2.3.2 Batch Extraction
Batch jobs (CRON, Airflow DAGs) pull large volumes at scheduled intervals. They are straightforward to implement but introduce latency.
#### 2.3.3 Streaming Ingestion
Techniques like Apache Kafka, Amazon Kinesis, or Azure Event Hubs ingest data in near real‑time. They require a continuous pipeline and stateful processing but keep the data fresh.
#### 2.3.4 Web Scraping
For data that lives only on the web, libraries such as BeautifulSoup, Scrapy, or Selenium can automate the collection. Respect `robots.txt` and rate limits to stay compliant.
#### 2.3.5 API Wrappers
Wrap a third‑party API in a reusable library. Include error handling, retries, and pagination logic. This encapsulation protects downstream processes from API changes.
---
### 2.4 Data Quality and Governance at the Source
Acquisition is not just about pulling data; it’s also about *capturing* it correctly.
1. **Schema Validation** – Ensure the data conforms to expected types and ranges before it enters your pipeline.
2. **Metadata Capture** – Store provenance (source, timestamp, extractor version) with every record.
3. **Compliance Checks** – Apply GDPR, CCPA, or industry‑specific rules during extraction (e.g., data masking).
4. **Rate Limiting & Back‑Off** – Avoid throttling by obeying API limits and implementing exponential back‑off.
5. **Monitoring & Alerting** – Track ingestion failures, latency, and data drift in real‑time.
A robust acquisition layer should flag anomalies early, preventing a cascade of errors downstream.
---
### 2.5 Building the Acquisition Pipeline
Below is a high‑level template you can adapt. It uses Python, Airflow, and Kafka, but the concepts are transferable.
```python
# acquirer.py
import requests
import json
import kafka
class DataAcquirer:
def __init__(self, api_url, api_key, kafka_topic):
self.api_url = api_url
self.headers = {'Authorization': f'Bearer {api_key}'}
self.producer = kafka.KafkaProducer(
bootstrap_servers='broker1:9092',
value_serializer=lambda v: json.dumps(v).encode('utf-8'))
self.topic = kafka_topic
def fetch(self):
response = requests.get(self.api_url, headers=self.headers)
response.raise_for_status()
return response.json()
def stream_to_kafka(self, data):
for record in data:
enriched = {
'payload': record,
'ingest_ts': int(time.time()),
'source': self.api_url
}
self.producer.send(self.topic, enriched)
self.producer.flush()
# airflow_dag.py
from airflow import DAG
from airflow.operators.python import PythonOperator
from datetime import datetime
from acquirer import DataAcquirer
def run_acquisition(**context):
acquirer = DataAcquirer(
api_url='https://api.example.com/v1/data',
api_key='YOUR_KEY',
kafka_topic='raw_data_topic')
raw = acquirer.fetch()
acquirer.stream_to_kafka(raw)
with DAG('daily_data_acquisition', start_date=datetime(2023, 1, 1), schedule_interval='@daily') as dag:
acquisition_task = PythonOperator(
task_id='acquire_data',
python_callable=run_acquisition)
```
**Key Takeaways**
- Use a *producer* to decouple the fetch logic from downstream consumers.
- Store minimal but essential metadata with each record.
- Keep the acquisition script idempotent to avoid duplicate ingestion.
---
### 2.6 Tooling Landscape
| Tool | Purpose | Language | Notes |
|------|---------|----------|-------|
| **Airflow** | Orchestrate batch jobs | Python | Excellent for complex DAGs |
| **Prefect** | Workflow management with a focus on observability | Python | Modern API, integrates well with cloud services |
| **Kafka** | Streaming platform | Java/Scala | Durable, low‑latency ingestion |
| **Debezium** | Change Data Capture (CDC) from databases | Java | Turns database changes into Kafka events |
| **Scrapy** | Web scraping | Python | Built‑in concurrency |
| **PySpark** | Process big batch data | Python/Scala | Handles distributed data at scale |
| **Great Expectations** | Data validation & testing | Python | Declarative expectation suites |
Choose the right mix based on volume, latency, and team expertise.
---
### 2.7 Real‑World Case Studies
#### 2.7.1 Retail Price Optimization
A large retailer needed to monitor competitor prices in real‑time. Using a combination of **web scraping** (Scrapy) and **API pulls** (price feeds), they built a Kafka pipeline that ingested >1 M records per hour. The ingestion layer performed schema validation with Great Expectations, tagging any anomalies for manual review.
#### 2.7.2 Smart Manufacturing
An automotive plant used **IoT sensors** streaming to Azure Event Hubs. A custom Python consumer parsed binary telemetry and stored it in Azure Data Lake. Metadata captured the device ID, timestamp, and firmware version, enabling traceability for compliance audits.
#### 2.7.3 Financial Fraud Detection
A bank aggregated transaction logs from multiple legacy databases using **Debezium CDC** into Kafka topics. The ingestion pipeline enriched each record with geolocation data from a third‑party API and forwarded the data to an online Spark Structured Streaming job that flagged anomalies in real‑time.
---
### 2.8 Common Pitfalls & How to Avoid Them
| Pitfall | Symptom | Mitigation |
|----------|---------|------------|
| **Inadequate Schema Evolution** | Data format changes break downstream jobs | Implement schema registry (e.g., Confluent Schema Registry) and enforce strict versioning |
| **Ignoring Latency** | Real‑time signals arrive late, diminishing value | Use streaming ingestion and monitor latency dashboards |
| **Poor Error Handling** | Failed pulls cause silent data gaps | Add retries with exponential back‑off and alert on persistent failures |
| **Over‑Fetching** | Unnecessary API calls consume limits | Cache responses, respect pagination, and implement throttling |
| **Data Silos** | Data resides in disparate systems with no unified metadata | Adopt a central metadata store (e.g., Amundsen) and unify ingestion points |
---
### 2.9 Summary
- Acquisition is the **gateway** to all downstream analytics.
- Understand your source landscape and choose the right *pull* or *push* strategy.
- Embed quality checks, governance, and provenance at the source.
- Build modular, idempotent pipelines that can scale with data volume.
- Leverage the rich ecosystem of tools—Airflow, Kafka, Debezium, Scrapy, etc.—to suit your specific needs.
In the next chapter, we’ll transition from raw, noisy streams to the **cleaning** phase, turning imperfect data into reliable assets ready for exploration.