返回目錄
A
Data Science for Decision Makers: Turning Numbers into Insight - 第 2 章
Chapter 2: Foundations of Data
發布於 2026-02-24 12:17
# Chapter 2: Foundations of Data
> **Why this chapter matters**
>
> In a data‑driven organization, the *value* of the insights you deliver hinges entirely on the *quality* of the data you start with. This chapter equips you with a systematic approach to sourcing, acquiring, cleaning, validating, and governing data so that every downstream analysis and model is built on a solid foundation.
---
## 2.1 Understanding Data Sources
| Source Type | Typical Formats | Example Use‑Case |
|-------------|----------------|------------------|
| **Transactional** | CSV, JSON, Parquet, relational tables | Customer purchases, IoT sensor logs |
| **Unstructured** | Text, images, audio, video | Customer reviews, product images |
| **External APIs** | REST, GraphQL, WebSocket | Weather feeds, social media sentiment |
| **Streaming** | Kafka, Flink, Kinesis | Real‑time fraud detection |
| **Crowdsourced** | Forms, surveys | Market research surveys |
### Key Takeaways
- **Schema awareness**: Know whether your data is *structured* or *semi‑structured* before you design pipelines.
- **Provenance tracking**: Capture the origin of each data point—source system, extraction timestamp, and any transformations applied.
- **Volume & velocity**: High‑speed streams require different storage and processing patterns than batch‑loaded files.
## 2.2 Acquisition Techniques
1. **Batch ETL** – Extract, Transform, Load processes that run at scheduled intervals.
2. **Real‑time Ingestion** – Event‑driven pipelines using message queues.
3. **API Pulls** – Periodic calls to external services; often used for market data.
4. **Web Scraping** – Automated extraction from public web pages; be mindful of legal and ethical constraints.
5. **Direct Uploads** – File uploads from partners or internal users.
### Best Practices
- **Idempotency**: Design ingestion jobs to safely handle duplicates.
- **Incremental Loads**: Pull only new or changed records to save bandwidth.
- **Back‑pressure Handling**: Ensure downstream systems can signal upstream when they are saturated.
## 2.3 Data Quality Dimensions
| Dimension | Definition | Typical Checks |
|-----------|------------|----------------|
| **Accuracy** | How close data is to the truth | Cross‑validation against a master source |
| **Completeness** | Missingness of required fields | Null‑count reports |
| **Consistency** | Uniformity across systems | Schema reconciliation, foreign‑key integrity |
| **Timeliness** | Freshness of data | Lag metrics, timestamp verification |
| **Validity** | Adherence to business rules | Range checks, regex patterns |
| **Uniqueness** | No duplicate records | Deduplication logic |
### Data Quality Checklist
python
# Sample Python snippet for a quality check pipeline
import pandas as pd
def quality_report(df: pd.DataFrame) -> pd.DataFrame:
report = pd.DataFrame(columns=[
'Column', 'Missing %', 'Invalid %', 'Duplicates %'
])
for col in df.columns:
missing = df[col].isna().mean() * 100
invalid = df[col].apply(lambda x: not is_valid(x)).mean() * 100
duplicates = df[col].duplicated().mean() * 100
report = report.append({
'Column': col,
'Missing %': missing,
'Invalid %': invalid,
'Duplicates %': duplicates
}, ignore_index=True)
return report
## 2.4 Data Cleaning & Validation
| Task | Tools | Typical Use |
|------|-------|-------------|
| **Missing‑Value Imputation** | pandas, sklearn.impute | Mean, median, KNN, MICE |
| **Outlier Detection** | IsolationForest, DBSCAN | Remove or flag anomalies |
| **Type Normalization** | type conversions, schema enforcement | Ensure columns have correct datatypes |
| **Deduplication** | hash joins, window functions | Remove repeated rows |
| **Standardization** | regex, mapping tables | Normalize phone numbers, addresses |
| **Validation Rules** | dbt tests, Great Expectations | Business logic enforcement |
### Example: Address Standardization
python
import re
def standardize_address(addr: str) -> str:
# Basic example: remove extra spaces, uppercase
addr = re.sub(r'\s+', ' ', addr).strip().upper()
return addr
## 2.5 Establishing a Data Governance Framework
| Governance Pillar | Responsibility | Tools/Processes |
|-------------------|----------------|-----------------|
| **Data Catalog** | Data Stewards | Amundsen, DataHub |
| **Data Lineage** | Data Engineers | Apache Atlas, OpenLineage |
| **Security & Privacy** | Security Officers | RBAC, GDPR compliance checks |
| **Data Quality** | Data Quality Manager | Great Expectations, Datafold |
| **Metadata Management** | Metadata Engineer | Delta Lake, Iceberg |
| **Policy Enforcement** | Compliance Officer | Data masking, encryption policies |
### Governance Cadence
1. **Policy Definition** – Write clear rules for data collection, access, and retention.
2. **Implementation** – Embed policies into data pipelines (e.g., schema enforcement in Spark).
3. **Monitoring** – Run automated quality and lineage checks on a nightly basis.
4. **Reporting** – Publish dashboards (e.g., Power BI, Looker) that show data quality scores.
5. **Review & Iterate** – Quarterly audit of policies against business changes.
## 2.6 Case Study: Building a Customer 360 View
> **Scenario**: A retail chain wants a unified view of each customer across online, in‑store, and mobile channels.
>
> **Challenges**:
> * Data resides in multiple siloed systems (CRM, POS, mobile app logs).
> * Inconsistent customer identifiers.
> * Varying data quality (missing emails, phone number formats).
>
> **Solution Steps**:
> 1. **Data Acquisition**: Pull data nightly via ETL and real‑time Kafka streams.
> 2. **Schema Harmonization**: Map fields to a common ontology.
> 3. **Identity Resolution**: Use deterministic (email) and probabilistic (fuzzy matching) techniques.
> 4. **Data Cleaning**: Impute missing demographics, deduplicate records.
> 5. **Governance**: Implement a data catalog, assign stewards, enforce GDPR consent flags.
> 6. **Validation**: Run Great Expectations tests to confirm field ranges and uniqueness.
> 7. **Result**: A 1‑to‑1 mapping for 98% of customers, enabling personalized marketing and accurate churn prediction.
---
## 2.7 Practical Checklist for Your Next Project
| Step | What to Do | Why It Matters |
|------|------------|----------------|
| **Map Data Sources** | Identify all upstream systems and their schemas. | Prevent integration surprises. |
| **Define Data Quality KPIs** | E.g., 95% completeness, <1% duplicate rate. | Quantifies expectations. |
| **Set Up Validation Rules** | Write tests in Great Expectations or dbt. | Automates early error detection. |
| **Implement Governance** | Assign data stewards, create a catalog. | Ensures accountability. |
| **Document Lineage** | Capture extraction times, transformation logic. | Critical for debugging and compliance. |
| **Monitor & Alert** | Nightly dashboards + automated alerts on KPI deviations. | Enables rapid response. |
> **Action Point**: For your upcoming project, create a *Data Quality Playbook* that lists all data sources, the quality dimensions you’ll monitor, the tools you’ll use, and the roles responsible for each stage.
---
## 2.8 Further Reading & Resources
- *Designing Data-Intensive Applications* – Martin Kleppmann
- *The Data Warehouse Toolkit* – Ralph Kimball & Margy Ross
- *Data Quality: The Accuracy Dimension* – Jack E. Olson
- *Data Governance: How to Design, Deploy and Sustain an Effective Data Governance Program* – John Ladley
---
> **Key Insight**: A robust foundation is not a one‑time setup; it’s a continuous cycle of acquisition, cleaning, validation, and governance that evolves with your organization’s data ecosystem.