Chapter 5: Feature Engineering & Dimensionality Reduction

發布於 2026-03-06 21:19

# Chapter 5: Feature Engineering & Dimensionality Reduction > *“Feature engineering is the art of turning raw data into signals that machines can learn from.”* – Andrew Ng ## 5.1 Why Feature Engineering Matters | Perspective | What It Enables | Example | |-------------|-----------------|---------| | **Model performance** | Reduces noise, amplifies signal | Adding *Age × Education* improves linear regression on income prediction | | **Interpretability** | Clear, domain‑driven attributes | Encoding *Marital Status* as *Single/Married/Other* aids business storytelling | | **Computational efficiency** | Fewer, more informative features | PCA reduces a 100‑dim dataset to 10 components without loss of variance | Feature engineering is the bridge between raw data and the machine learning model. Without thoughtful feature creation, even the most sophisticated algorithms can underperform or produce misleading results. ## 5.2 Core Concepts and Terminology | Term | Definition | |------|------------| | **Feature** | Individual measurable property of an instance | | **Feature Space** | Multidimensional space defined by all features | | **Dimensionality** | Number of features | | **High‑dimensionality** | Feature space with many dimensions, often leading to the *curse of dimensionality* | | **Feature Selection** | Choosing a subset of relevant features | | **Feature Extraction** | Transforming raw data into a compact representation | | **Encoding** | Converting categorical data into numeric form | | **Imputation** | Replacing missing values | | **Scaling** | Adjusting ranges (e.g., Min‑Max, Standardization) | ## 5.3 The Feature Engineering Workflow 1. **Understand the Data** * Domain knowledge, business objectives, and data provenance guide decisions. 2. **Pre‑process** * Clean missing values, outliers, and inconsistent formats. 3. **Encode Categorical Variables** * One‑hot, target, ordinal, or embedding encodings. 4. **Create Derived Features** * Interaction terms, polynomial features, aggregates, or domain‑specific transformations. 5. **Reduce Dimensionality** * Feature selection (filter, wrapper, embedded) or extraction (PCA, t‑SNE, autoencoders). 6. **Validate** * Cross‑validate models with and without new features to confirm benefit. ## 5.4 Encoding Strategies | Scenario | Preferred Encoding | Python Example | |----------|--------------------|----------------| | **Nominal** (no order) | One‑hot | `pd.get_dummies(df, columns=['color'])` | | **Ordinal** (ordered categories) | Label | `le = LabelEncoder(); df['grade'] = le.fit_transform(df['grade'])` | | **High‑cardinality** | Target or frequency | `df['country'] = df['country'].map(df.groupby('country').size())` | | **Embeddings** (deep learning) | Learnable embedding layer | `nn.Embedding(num_embeddings, embedding_dim)` | ### One‑Hot vs. Ordinal vs. Target Encoding - **One‑Hot**: Adds `k-1` binary columns for a feature with `k` categories. Avoids imposing arbitrary order. - **Ordinal**: Uses integers 0‑k‑1. Suitable when categories have natural order. - **Target**: Maps categories to the mean target value. Can cause target leakage if not properly cross‑validated. ## 5.5 Creating Derived Features | Technique | Use‑case | Code Snippet | |-----------|----------|---------------| | **Interaction** | Capturing joint effects | `df['age_sq'] = df['age'] ** 2` | | **Binning** | Discretizing continuous variables | `df['income_bin'] = pd.cut(df['income'], bins=4, labels=False)` | | **Date‑Time Splits** | Temporal insights | `df['month'] = df['date'].dt.month` | | **Text Length** | NLP features | `df['text_len'] = df['comment'].apply(len)` | | **Log Transformation** | Reduce skew | `df['log_income'] = np.log1p(df['income'])` | ### Practical Tips - **Avoid over‑engineering**: Keep the feature set manageable to prevent overfitting. - **Feature importance**: Use model‑based importance (e.g., RandomForest) to prune. - **Feature selection pipelines**: Integrate `SelectKBest` or `RFE` in scikit‑learn pipelines. ## 5.6 Dimensionality Reduction Techniques ### 5.6.1 Principal Component Analysis (PCA) - **Goal**: Orthogonal linear transformation to maximize variance. - **Key equations**: X_centered = X - X.mean(axis=0) cov = np.cov(X_centered, rowvar=False) eigvals, eigvecs = np.linalg.eigh(cov) components = eigvecs[:, ::-1] # descending order X_pca = X_centered @ components[:, :n_components] - **When to use**: High‑dimensional numeric data with multicollinearity. - **Pros**: Fast, interpretable eigenvalues. - **Cons**: Linear assumption. ### 5.6.2 t‑Distributed Stochastic Neighbor Embedding (t‑SNE) - **Goal**: Preserve local structure for visualization. - **Parameters**: `perplexity`, `learning_rate`, `n_iter`. - **When to use**: Exploratory 2‑D/3‑D visualizations of clusters. - **Caveat**: Not suitable for downstream modeling due to non‑linear mapping. ### 5.6.3 Autoencoders - **Goal**: Learn a compressed representation via neural networks. - **Architecture**: Encoder → Bottleneck → Decoder. - **When to use**: Non‑linear feature extraction, denoising. ## 5.7 Practical Workflow Example Below is a concise example applying feature engineering to the *Adult Census Income* dataset from UCI. The goal is to predict whether a person earns >$50K. python import pandas as pd import numpy as np from sklearn.model_selection import train_test_split from sklearn.preprocessing import OneHotEncoder, StandardScaler from sklearn.pipeline import Pipeline from sklearn.compose import ColumnTransformer from sklearn.ensemble import RandomForestClassifier from sklearn.metrics import roc_auc_score # Load data url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data' cols = ['age', 'workclass', 'fnlwgt', 'education', 'education-num', 'marital-status', 'occupation', 'relationship', 'race', 'sex', 'capital-gain', 'capital-loss', 'hours-per-week', 'native-country', 'income'] df = pd.read_csv(url, names=cols, na_values=' ?') # 1. Impute missing values numeric = df.select_dtypes(include='number').columns categorical = df.select_dtypes(exclude='number').columns.drop('income') imputer_num = Pipeline([('fill', SimpleImputer(strategy='median'))]) imputer_cat = Pipeline([('fill', SimpleImputer(strategy='most_frequent'))]) preprocess = ColumnTransformer([('num', imputer_num, numeric), ('cat', imputer_cat, categorical)]) X = df.drop('income', axis=1) y = df['income'].apply(lambda x: 1 if x.strip() == '>50K' else 0) X_imputed = preprocess.fit_transform(X) # 2. Encode categoricals (One‑hot) and scale numerics numeric_features = numeric.tolist() categorical_features = categorical.tolist() num_pipeline = Pipeline([('scaler', StandardScaler())]) cat_pipeline = Pipeline([('onehot', OneHotEncoder(handle_unknown='ignore'))]) preprocess_final = ColumnTransformer([('num', num_pipeline, numeric_features), ('cat', cat_pipeline, categorical_features)]) X_final = preprocess_final.fit_transform(X_imputed) # 3. Train‑test split X_train, X_test, y_train, y_test = train_test_split(X_final, y, test_size=0.2, random_state=42, stratify=y) # 4. Model model = RandomForestClassifier(n_estimators=200, random_state=42, n_jobs=-1) model.fit(X_train, y_train) pred = model.predict_proba(X_test)[:, 1] print('ROC‑AUC:', roc_auc_score(y_test, pred)) **Key Takeaways** - **Pipeline** guarantees reproducibility: preprocessing is applied consistently. - **Imputation** before encoding avoids dummy‑variable traps. - **Scaling** is essential for distance‑based algorithms; Random Forest is scale‑agnostic but helpful for interpretability. ## 5.8 Common Pitfalls & How to Avoid Them | Pitfall | Symptom | Remedy | |----------|---------|--------| | **Data leakage** | Test set statistics leak into training | Use `ColumnTransformer` inside `Pipeline` and fit only on training data | | **Over‑encoding** | Too many dummy variables, causing sparsity | Use `drop='first'` or dimensionality reduction after encoding | | **Ignoring categorical hierarchy** | Treating ordinal variables as nominal | Map to integer scale preserving order | | **Outlier propagation** | Model becomes unstable | Detect via IQR or z‑score; consider robust scaling | | **Inadequate validation** | Feature importance over‑estimated | Use nested cross‑validation for hyper‑parameter tuning | ## 5.9 Ethical and Reproducibility Considerations - **Feature privacy**: Avoid encoding sensitive attributes that can indirectly expose protected groups. - **Bias amplification**: Some engineered features (e.g., occupation categories) may encode historical biases; audit feature importance across demographic slices. - **Documentation**: Store all transformation steps in a versioned notebook or `dvc` pipeline. - **Data lineage**: Track original values, imputations, and transformations using a data catalog. ## 5.10 Summary Feature engineering and dimensionality reduction are the linchpins of a robust data science workflow. By transforming raw, messy data into clean, informative signals, we empower models to learn patterns more effectively and to generalize beyond the training set. Key practices include: 1. **Thoughtful encoding** tailored to category semantics. 2. **Derived features** that capture domain‑specific interactions. 3. **Dimensionality reduction** to mitigate the curse of dimensionality and enhance interpretability. 4. **Reproducible pipelines** that encapsulate preprocessing steps. 5. **Ethical vigilance** to ensure fairness and compliance. In the next chapter, we will translate these engineered features into predictive power using supervised learning algorithms, carefully tuning hyper‑parameters and validating model performance.

Chapter 4: Exploratory Data Analysis (EDA)

Chapter 6: From Features to Forecasts: Building Predictive Models