聊天視窗

Data Science for the Analytical Mind: From Raw Data to Insightful Decisions - 第 3 章

Chapter 3: Data Cleaning & Feature Engineering

發布於 2026-03-03 14:46

# Chapter 3: Data Cleaning & Feature Engineering > *“The data science pipeline is only as strong as its weakest link.”* – 墨羽行 Data cleaning is the unsung hero of every successful data science project. A model that looks impressive on paper can fail spectacularly when fed real‑world data that is noisy, incomplete, or incorrectly formatted. This chapter walks you through the most common pitfalls in raw data and shows how to transform messy acquisitions into a clean, engineered dataset that fuels high‑performance models. --- ## 3.1 The Cleaning Process in Context | Phase | What It Involves | Typical Tools | |-------|------------------|---------------| | **Ingestion** | Load raw data, verify schema, capture provenance | Pandas, SQL, Kafka, S3 | | **Validation** | Check for type, range, and consistency | Data Quality frameworks, Great Expectations | | **Transformation** | Standardize formats, normalize values | Pandas, NumPy, scikit‑learn | | **Enrichment** | Add derived columns, external lookup | Featuretools, custom pipelines | | **Export** | Persist clean data, version control | Delta Lake, Parquet, Git | > **Best practice:** Treat the entire cleaning process as *code*. Store transformation scripts in a Git repository, run them in CI pipelines, and document every decision. --- ## 3.2 Handling Missing Data Missing values are a universal problem. The key is to understand *why* values are missing and choose a strategy that aligns with your business context. ### 3.2.1 Missing Data Mechanisms | Mechanism | Definition | Example | |-----------|------------|---------| | MCAR (Missing Completely At Random) | Missingness unrelated to any observed or unobserved data | Survey skipped due to a random system glitch | | MAR (Missing At Random) | Missingness related to observed data | Older customers are less likely to report income | | MNAR (Missing Not At Random) | Missingness related to the missing value itself | High‑income individuals decline to disclose earnings | ### 3.2.2 Imputation Strategies | Strategy | When to Use | Typical Code | |----------|-------------|--------------| | **Delete rows/columns** | < 5 % missing, irrecoverable | `df.dropna(axis=0)` | | **Mean/Median/Mode** | Numerical features, MCAR | `df['col'].fillna(df['col'].median(), inplace=True)` | | **Forward/Backward fill** | Time‑series data | `df['col'].fillna(method='ffill', inplace=True)` | | **KNN Imputation** | Small dataset, correlated features | `KNNImputer().fit_transform(df)` | | **Regression Imputation** | Predictive context | `SimpleImputer(strategy='mean')` + `LinearRegression` | | **Multiple Imputation** | Statistical rigor | `IterativeImputer()` | | **Domain‑specific rules** | Business logic | Custom lambda functions | #### Practical Insight Always *explore* the pattern of missingness before deciding. Visual tools like heatmaps or bar plots of missing rates help spot systematic issues. python import seaborn as sns import matplotlib.pyplot as plt plt.figure(figsize=(10, 6)) sns.heatmap(df.isnull(), cbar=False, yticklabels=False) plt.title('Missing Value Heatmap') plt.show() --- ## 3.3 Detecting and Managing Outliers Outliers can be data entry errors, extreme events, or real but rare phenomena. ### 3.3.1 Common Detection Techniques | Method | How it Works | Typical Code | |--------|--------------|--------------| | **Statistical** (IQR, Z‑score) | Uses distributional assumptions | `np.abs(z_scores) > 3` | | **Isolation Forest** | Anomaly detection algorithm | `IsolationForest().fit_predict(df[['col']])` | | **Visual** (boxplot, scatter) | Human inspection | `sns.boxplot(x='col', data=df)` | ### 3.3.2 Treatment Options | Option | When to Apply | |--------|---------------| | **Capping** | Preserve extreme values but limit impact | `df['col'] = df['col'].clip(lower, upper)` | | **Winsorizing** | Replace extremes with percentile values | `scipy.stats.mstats.winsorize(df['col'], limits=[0.05, 0.05])` | | **Transformation** | Reduce skewness | `np.log1p(df['col'])` | | **Removal** | Clear errors | `df = df[df['col'] < threshold]` | > **Rule of thumb:** If an outlier lies *beyond 3 IQR* from the median, investigate before removal. --- ## 3.4 Noise Reduction and Signal Preservation Noise refers to random fluctuations that obscure underlying patterns. ### 3.4.1 Smoothing Techniques | Technique | Application | Code Snippet | |-----------|-------------|--------------| | **Moving Average** | Time‑series smoothing | `df['col'].rolling(window=3).mean()` | | **Gaussian Kernel** | Spatial or image data | `gaussian_filter(img, sigma=1)` | | **Laplacian Smoothing** | Edge detection in images | `cv2.Laplacian(img, cv2.CV_64F)` | | **Principal Component Analysis (PCA)** | Dimensionality reduction to filter noise | `PCA(n_components=5).fit_transform(df)` | ### 3.4.2 Signal‑to‑Noise Ratio (SNR) Maintain a high SNR by *removing irrelevant variables* and *scaling* features. Use correlation matrices to prune redundant columns. python corr_matrix = df.corr().abs() upper = corr_matrix.where(np.triu(np.ones(corr_matrix.shape), k=1).astype(bool)) to_drop = [column for column in upper.columns if any(upper[column] > 0.95)] df.drop(columns=to_drop, inplace=True) --- ## 3.5 Feature Engineering: Turning Raw Data into Predictive Power Feature engineering is where domain knowledge meets statistical insight. Below are the most impactful transformations. ### 3.5.1 Scaling and Normalization | Technique | Purpose | Typical Code | |-----------|---------|--------------| | **StandardScaler** | Zero mean, unit variance | `StandardScaler().fit_transform(df[['col']])` | | **MinMaxScaler** | Map to [0, 1] | `MinMaxScaler().fit_transform(df[['col']])` | | **PowerTransformer** | Stabilize variance | `PowerTransformer().fit_transform(df[['col']])` | ### 3.5.2 Encoding Categorical Variables | Encoding | Use‑case | Example | |----------|----------|---------| | **LabelEncoder** | Ordinal categories | `LabelEncoder().fit_transform(df['cat'])` | | **OneHotEncoder** | Nominal categories | `OneHotEncoder(sparse=False).fit_transform(df[['cat']])` | | **Target Encoding** | When categories have predictive power | `df['cat_enc'] = df.groupby('cat')['target'].transform('mean')` | | **Frequency Encoding** | Rare categories handled gracefully | `df['cat_freq'] = df['cat'].map(df['cat'].value_counts() / len(df))` | ### 3.5.3 Interaction Features Capturing synergy between variables often boosts model performance. python # Example: Interaction between price and discount import pandas as pd df['price_discount'] = df['price'] * df['discount_rate'] ### 3.5.4 Temporal Features | Feature | Description | Code | |---------|-------------|------| | **Hour of Day** | Time‑of‑day effect | `df['hour'] = df['timestamp'].dt.hour` | | **Day of Week** | Weekday vs weekend | `df['dow'] = df['timestamp'].dt.dayofweek` | | **Rolling Aggregates** | Trend over past N periods | `df['sales_7d_avg'] = df['sales'].rolling(window=7).mean()` | ### 3.5.5 Spatial Features For geospatial data, distance to nearest points of interest, clustering by region, or using latitude/longitude directly. python from sklearn.neighbors import BallTree import numpy as np coords = np.radians(df[['lat', 'lon']].values) bt = BallTree(coords, metric='haversine') # Find distance to nearest point dist, ind = bt.query(coords, k=2) # k=2 includes self df['dist_to_nearest'] = dist[:,1] * 6371 # km ### 3.5.6 Text Features | Technique | Use‑case | Tool | |-----------|----------|------| | **TF‑IDF** | Word importance | `TfidfVectorizer` | | **Word2Vec / Doc2Vec** | Dense embeddings | `gensim.models.Word2Vec` | | **Sentiment Scores** | Customer feedback | `TextBlob` or `VADER` | ### 3.5.7 Domain‑Specific Derived Features | Domain | Typical Derived Feature | Rationale | |--------|------------------------|-----------| | Finance | Volatility index | Captures risk | | Healthcare | BMI (weight/height²) | Correlates with health outcomes | | Retail | Basket diversity (entropy) | Measures customer preference breadth | --- ## 3.6 Building a Reproducible Feature Pipeline A clean pipeline ensures that transformations are applied consistently across training, validation, and production. python from sklearn.pipeline import Pipeline from sklearn.preprocessing import StandardScaler, OneHotEncoder from sklearn.compose import ColumnTransformer numeric_features = ['age', 'income', 'score'] categorical_features = ['gender', 'region'] numeric_transformer = Pipeline(steps=[('scaler', StandardScaler())]) categorical_transformer = Pipeline(steps=[('onehot', OneHotEncoder(handle_unknown='ignore'))]) preprocessor = ColumnTransformer(transformers=[ ('num', numeric_transformer, numeric_features), ('cat', categorical_transformer, categorical_features) ]) pipeline = Pipeline(steps=[('preprocess', preprocessor), ('model', SomeEstimator())]) > **Tip:** Use `FeatureUnion` to combine multiple pipelines (e.g., raw features + engineered features) for advanced ensembles. --- ## 3.7 Common Pitfalls & How to Avoid Them | Pitfall | Why It Happens | Mitigation | |---------|----------------|------------| | **Data Leakage** | Features derived from future data | Keep train/test split before feature engineering | | **Over‑Engineering** | Too many features lead to overfitting | Apply feature selection (e.g., Recursive Feature Elimination) | | **Ignoring Missingness** | Imputation can introduce bias | Treat missingness as a feature or use models that handle it natively | | **Hard‑Coding Rules** | Non‑portable code | Store rules in configuration files or use a rule engine | | **Version Drift** | Data schema changes over time | Use schema registry and automated migration scripts | --- ## 3.8 Case Study: Clean‑Up & Feature Engineering for an E‑Commerce Sales Forecast | Step | Action | Outcome | |------|--------|---------| | **1** | Import raw sales logs (timestamp, SKU, price, quantity, promo flag) | 1 M rows, 12 columns | | **2** | Handle missing `price` by median per SKU | 0.8 % missing eliminated | | **3** | Remove outliers in `quantity` (>3 IQR per SKU) | Removed 0.5 % anomalous orders | | **4** | Create `total_revenue = price * quantity` | New target variable | | **5** | Encode `promo_flag` using target encoding | Captured promo impact | | **6** | Generate lag features (`sales_lag_1`, `sales_lag_7`) | Captured seasonality | | **7** | Build `sales_date` calendar features (dow, month, holiday flag) | Improved temporal patterns | | **8** | Train Gradient Boosting Regressor | MAE improved from 12.3 to 9.8 | | **9** | Deploy pipeline via MLflow | Automated retraining on new data | > **Key takeaway:** Systematic cleaning + thoughtful feature engineering reduced error by ~20 % and increased forecast confidence. --- ## 3.9 Take‑away 1. **Understand the root cause** of missingness, outliers, and noise before applying generic fixes. 2. **Treat cleaning as code** – version, test, and document every transformation. 3. **Engineer features that reflect domain knowledge**; simple transformations (scaling, encoding) often provide the biggest gains. 4. **Monitor feature statistics** in production to detect drift early. 5. **Balance complexity and interpretability** – an overly engineered feature set can hurt model robustness. --- > *“Data cleaning is not a one‑off task; it’s a mindset that keeps the data pipeline resilient and the insights trustworthy.”* – 墨羽行