Chapter 4: Exploratory Data Analysis & Feature Engineering

發布於 2026-03-01 17:31

# Chapter 4: Exploratory Data Analysis & Feature Engineering In the previous chapters we laid the foundation: we mapped our data sources, built a compliant and auditable pipeline, and ensured that every byte of information traveled safely through the data lake. Now we shift our focus to the heart of value creation: turning raw, messy data into *features*—the actionable signals that drive predictive models. --- ## 1. The EDA Mind‑set Exploratory Data Analysis (EDA) is not a checkbox; it is an *intuitive dialogue* between the analyst and the data. Think of it as a detective who first reviews the crime scene, then gradually forms hypotheses. As we proceed, keep these guiding principles in mind: | Principle | Why it Matters | |---|---| | **Open‑Ended Inquiry** | Openness drives creativity—question anything, even the obvious. | **Systematic Documentation** | Conscientiousness ensures we capture every insight for reproducibility. | **Balanced Skepticism** | Extraversion keeps the discussion lively, but we guard against confirmation bias. | **Iterative Refinement** | Agreeableness means we value stakeholder input; we refine hypotheses accordingly. | **Emotional Resilience** | Low neuroticism keeps us calm when outliers throw us off course. | ### 1.1. Tools & Environment * **Python** (pandas, seaborn, matplotlib, plotly) * **SQL** (for quick exploratory queries on our lakehouse) * **JupyterLab** (notebooks for reproducible EDA) * **R** (optional for advanced statistical tests) --- ## 2. The Case Study: Customer‑Lifetime‑Value (CLV) Source For this chapter, we will walk through an end‑to‑end pipeline blueprint that transforms a raw customer‑transaction source into a curated dataset ready for feature engineering. This source is the *Retail Transaction Log*—a nightly ingestion of all point‑of‑sale events. ### 2.1. Source Profile | Field | Type | Example | Notes | |---|---|---|---| | transaction_id | string | TXN123456 | Unique key | | customer_id | string | CUST98765 | Foreign key to Customer Master | | transaction_timestamp | datetime | 2024‑02‑14 18:23:45 | UTC | | store_id | string | ST001 | Store reference | | product_id | string | PRD56789 | SKU | | quantity | integer | 3 | Number of units | | price | decimal | 19.99 | Unit price | | payment_method | string | credit_card | Payment type | | transaction_type | string | sale | sale / return | | ### 2.2. Ingestion Mode * **Batch** – nightly Spark job pulls the latest change‑data capture (CDC) from the transactional database. * **Schema‑On‑Read** – raw JSON is stored in S3 with minimal transformation to preserve fidelity. sql -- Example Glue job script snippet spark.read.format("jdbc") .option("url", "jdbc:postgresql://prod-db:5432/retail") .option("dbtable", "public.transaction_log") .option("user", "data_eng") .option("password", "***") .load() .write.format("parquet") .mode("append") .save("s3://retail-datalake/raw/transaction_log/") ### 2.3. Storage Choice | Layer | Storage | Rationale | |---|---|---| | Raw | **Amazon S3** (Parquet) | Immutable, cost‑effective, schema‑on‑read | | Bronze | **Delta Lake** | Versioned, ACID transactions for first‑pass cleaning | | Silver | **Snowflake** | Structured warehousing, easy SQL access | | Gold | **Redshift Spectrum** | Fast joins across Silver and other fact tables | | The **Bronze** layer is where we perform *lightweight* cleaning: deduplication, timestamp conversion, and basic null handling. The **Silver** layer is our *feature‑ready* table, enriched with business keys and ready for analytics. ### 2.4. Transformation Steps 1. **Deduplication** – Keep the latest record per `transaction_id`. 2. **Anomaly Flagging** – Identify negative quantities or prices. 3. **Time‑Zone Normalization** – Convert to UTC. 4. **Return Reconciliation** – Subtract returned items from sales. 5. **Aggregate Summaries** – Compute per‑customer, per‑store totals. 6. **Derived Fields** – Compute `total_amount`, `is_fraud_candidate`. 7. **Data Quality Check** – Ensure `quantity > 0` for sales. python import pyspark.sql.functions as F bronze_df = spark.read.parquet("s3://retail-datalake/raw/transaction_log/") # Deduplication bronze_dedup = bronze_df.dropDuplicates(['transaction_id']) # Anomaly flagging bronze_clean = bronze_dedup.withColumn( 'anomaly', F.when((F.col('quantity') <= 0) | (F.col('price') <= 0), True).otherwise(False) ) # Time‑zone normalization bronze_clean = bronze_clean.withColumn( 'transaction_ts', F.from_utc_timestamp(F.col('transaction_timestamp'), 'UTC') ) # Return reconciliation silver_df = bronze_clean.groupBy('customer_id', 'store_id') .agg( F.sum(F.when(F.col('transaction_type') == 'sale', F.col('quantity') * F.col('price'))).alias('sales_amount'), F.sum(F.when(F.col('transaction_type') == 'return', F.col('quantity') * F.col('price'))).alias('return_amount') ) .withColumn('net_amount', F.col('sales_amount') - F.col('return_amount')) ### 2.5. Monitoring Metrics | Metric | Definition | Threshold | Alert Mechanism | |---|---|---|---| | Ingested Record Count | Total rows processed nightly | None (expected range 5M–10M) | Slack channel when out of bounds | | Duplicate Ratio | `duplicate_rows / total_rows` | < 0.01 | Email to DataOps | | Anomaly Percentage | `anomaly_rows / total_rows` | < 0.05 | PagerDuty alert | | Null Ratio per Column | Count of nulls / total | < 0.001 | Tableau dashboard | | Ingestion Latency | Time from job start to completion | < 2h | CloudWatch metrics | | **Dashboard** – A lightweight Tableau or PowerBI sheet pulls these metrics via a nightly SQL job, providing real‑time visibility to executives. --- ## 3. Feature Engineering Principles Now that the *silver* table is ready, we focus on crafting the **features** that will feed into predictive models. Here are our guiding tenets: 1. **Business Relevance** – Features must have a clear link to the target (e.g., CLV). 2. **Temporal Granularity** – Choose the right window (daily, weekly, monthly) for each metric. 3. **Simplicity vs. Power** – Complex engineered features (e.g., polynomial terms) only if they significantly improve performance. 4. **Avoid Leakage** – Do not use information from the future relative to the prediction time. 5. **Scalability** – Features should be calculable in batch or streaming without excessive compute. ### 3.1. Example Features for CLV | Feature | Calculation | Why It Helps | |---|---|---| | `recency` | Days since last purchase | Indicates customer engagement | | `frequency` | Count of purchases in last 6 months | Captures loyalty | | `monetary` | Total spend in last 6 months | Direct revenue proxy | | `average_order_value` | `monetary / frequency` | Higher AOV signals high‑ticket customers | | `return_rate` | `returns / purchases` | High return rates predict churn | | `payment_method_drift` | Count of payment types used | Frequent changes may signal risk | | `store_proximity` | Distance to nearest store | Geo‑centric loyalty | | We use **Spark UDFs** or **Delta Live Tables** to compute these features nightly, storing the result in a `customer_features` table in Snowflake. This table becomes the *gold* layer for downstream ML. ### 3.2. Feature Store Integration - **Catalog**: Glue Data Catalog with metadata (description, data type, lineage). - **Versioning**: Delta Lake table versioned to capture feature evolution. - **Access Control**: Fine‑grained IAM policies ensuring only analytics teams can read. - **Metrics**: Track feature stability via mean/std dev drift checks every week. --- ## 4. Closing Reflections Exploratory Data Analysis is the *bridge* between data ingestion and model training. By systematically probing our data, we surface insights that guide feature selection, avoid pitfalls like leakage, and ultimately empower the model to deliver measurable ROI. The blueprint we presented for the Retail Transaction Log is not just a technical diagram—it is a *living strategy.* As the organization grows, new sources (e.g., customer support tickets, website clickstreams) will plug into the same lifecycle: ingest → clean → enrich → monitor. The same principles of openness, conscientiousness, balanced skepticism, and resilience will carry us forward. > **Next Chapter Preview**: In Chapter 5 we dive into *Model Development & Validation*, where we transform these curated features into robust, production‑grade predictive models.

Chapter 3: Building Robust Data Pipelines

Chapter 5: Predictive Modeling – Algorithms & Evaluation