Chapter 2: Foundations of Data

發布於 2026-02-24 12:17

# Chapter 2: Foundations of Data > **Why this chapter matters** > > In a data‑driven organization, the *value* of the insights you deliver hinges entirely on the *quality* of the data you start with. This chapter equips you with a systematic approach to sourcing, acquiring, cleaning, validating, and governing data so that every downstream analysis and model is built on a solid foundation. --- ## 2.1 Understanding Data Sources | Source Type | Typical Formats | Example Use‑Case | |-------------|----------------|------------------| | **Transactional** | CSV, JSON, Parquet, relational tables | Customer purchases, IoT sensor logs | | **Unstructured** | Text, images, audio, video | Customer reviews, product images | | **External APIs** | REST, GraphQL, WebSocket | Weather feeds, social media sentiment | | **Streaming** | Kafka, Flink, Kinesis | Real‑time fraud detection | | **Crowdsourced** | Forms, surveys | Market research surveys | ### Key Takeaways - **Schema awareness**: Know whether your data is *structured* or *semi‑structured* before you design pipelines. - **Provenance tracking**: Capture the origin of each data point—source system, extraction timestamp, and any transformations applied. - **Volume & velocity**: High‑speed streams require different storage and processing patterns than batch‑loaded files. ## 2.2 Acquisition Techniques 1. **Batch ETL** – Extract, Transform, Load processes that run at scheduled intervals. 2. **Real‑time Ingestion** – Event‑driven pipelines using message queues. 3. **API Pulls** – Periodic calls to external services; often used for market data. 4. **Web Scraping** – Automated extraction from public web pages; be mindful of legal and ethical constraints. 5. **Direct Uploads** – File uploads from partners or internal users. ### Best Practices - **Idempotency**: Design ingestion jobs to safely handle duplicates. - **Incremental Loads**: Pull only new or changed records to save bandwidth. - **Back‑pressure Handling**: Ensure downstream systems can signal upstream when they are saturated. ## 2.3 Data Quality Dimensions | Dimension | Definition | Typical Checks | |-----------|------------|----------------| | **Accuracy** | How close data is to the truth | Cross‑validation against a master source | | **Completeness** | Missingness of required fields | Null‑count reports | | **Consistency** | Uniformity across systems | Schema reconciliation, foreign‑key integrity | | **Timeliness** | Freshness of data | Lag metrics, timestamp verification | | **Validity** | Adherence to business rules | Range checks, regex patterns | | **Uniqueness** | No duplicate records | Deduplication logic | ### Data Quality Checklist python # Sample Python snippet for a quality check pipeline import pandas as pd def quality_report(df: pd.DataFrame) -> pd.DataFrame: report = pd.DataFrame(columns=[ 'Column', 'Missing %', 'Invalid %', 'Duplicates %' ]) for col in df.columns: missing = df[col].isna().mean() * 100 invalid = df[col].apply(lambda x: not is_valid(x)).mean() * 100 duplicates = df[col].duplicated().mean() * 100 report = report.append({ 'Column': col, 'Missing %': missing, 'Invalid %': invalid, 'Duplicates %': duplicates }, ignore_index=True) return report ## 2.4 Data Cleaning & Validation | Task | Tools | Typical Use | |------|-------|-------------| | **Missing‑Value Imputation** | pandas, sklearn.impute | Mean, median, KNN, MICE | | **Outlier Detection** | IsolationForest, DBSCAN | Remove or flag anomalies | | **Type Normalization** | type conversions, schema enforcement | Ensure columns have correct datatypes | | **Deduplication** | hash joins, window functions | Remove repeated rows | | **Standardization** | regex, mapping tables | Normalize phone numbers, addresses | | **Validation Rules** | dbt tests, Great Expectations | Business logic enforcement | ### Example: Address Standardization python import re def standardize_address(addr: str) -> str: # Basic example: remove extra spaces, uppercase addr = re.sub(r'\s+', ' ', addr).strip().upper() return addr ## 2.5 Establishing a Data Governance Framework | Governance Pillar | Responsibility | Tools/Processes | |-------------------|----------------|-----------------| | **Data Catalog** | Data Stewards | Amundsen, DataHub | | **Data Lineage** | Data Engineers | Apache Atlas, OpenLineage | | **Security & Privacy** | Security Officers | RBAC, GDPR compliance checks | | **Data Quality** | Data Quality Manager | Great Expectations, Datafold | | **Metadata Management** | Metadata Engineer | Delta Lake, Iceberg | | **Policy Enforcement** | Compliance Officer | Data masking, encryption policies | ### Governance Cadence 1. **Policy Definition** – Write clear rules for data collection, access, and retention. 2. **Implementation** – Embed policies into data pipelines (e.g., schema enforcement in Spark). 3. **Monitoring** – Run automated quality and lineage checks on a nightly basis. 4. **Reporting** – Publish dashboards (e.g., Power BI, Looker) that show data quality scores. 5. **Review & Iterate** – Quarterly audit of policies against business changes. ## 2.6 Case Study: Building a Customer 360 View > **Scenario**: A retail chain wants a unified view of each customer across online, in‑store, and mobile channels. > > **Challenges**: > * Data resides in multiple siloed systems (CRM, POS, mobile app logs). > * Inconsistent customer identifiers. > * Varying data quality (missing emails, phone number formats). > > **Solution Steps**: > 1. **Data Acquisition**: Pull data nightly via ETL and real‑time Kafka streams. > 2. **Schema Harmonization**: Map fields to a common ontology. > 3. **Identity Resolution**: Use deterministic (email) and probabilistic (fuzzy matching) techniques. > 4. **Data Cleaning**: Impute missing demographics, deduplicate records. > 5. **Governance**: Implement a data catalog, assign stewards, enforce GDPR consent flags. > 6. **Validation**: Run Great Expectations tests to confirm field ranges and uniqueness. > 7. **Result**: A 1‑to‑1 mapping for 98% of customers, enabling personalized marketing and accurate churn prediction. --- ## 2.7 Practical Checklist for Your Next Project | Step | What to Do | Why It Matters | |------|------------|----------------| | **Map Data Sources** | Identify all upstream systems and their schemas. | Prevent integration surprises. | | **Define Data Quality KPIs** | E.g., 95% completeness, <1% duplicate rate. | Quantifies expectations. | | **Set Up Validation Rules** | Write tests in Great Expectations or dbt. | Automates early error detection. | | **Implement Governance** | Assign data stewards, create a catalog. | Ensures accountability. | | **Document Lineage** | Capture extraction times, transformation logic. | Critical for debugging and compliance. | | **Monitor & Alert** | Nightly dashboards + automated alerts on KPI deviations. | Enables rapid response. | > **Action Point**: For your upcoming project, create a *Data Quality Playbook* that lists all data sources, the quality dimensions you’ll monitor, the tools you’ll use, and the roles responsible for each stage. --- ## 2.8 Further Reading & Resources - *Designing Data-Intensive Applications* – Martin Kleppmann - *The Data Warehouse Toolkit* – Ralph Kimball & Margy Ross - *Data Quality: The Accuracy Dimension* – Jack E. Olson - *Data Governance: How to Design, Deploy and Sustain an Effective Data Governance Program* – John Ladley --- > **Key Insight**: A robust foundation is not a one‑time setup; it’s a continuous cycle of acquisition, cleaning, validation, and governance that evolves with your organization’s data ecosystem.

Chapter 1: The Data Science Landscape

Chapter 3: Exploratory Data Analysis (EDA)