Chapter 3: Cleaning and Preparing Data

發布於 2026-03-07 18:27

# Chapter 3: Cleaning and Preparing Data In this chapter we bridge the gap between raw data acquisition and the analytical or modeling steps that follow. Data cleaning is often the most time‑consuming yet most critical part of any data science project. A clean dataset not only speeds up experimentation but also safeguards against biased or misleading insights. --- ## 3.1 Why Cleaning Matters | **Aspect** | **Impact** | **Business Example** | |------------|------------|---------------------| | Missing values | Inaccurate statistics; model bias | A customer‑segmentation model trained on incomplete purchase histories misclassifies high‑value customers, leading to lost revenue | | Outliers | Skewed distributions; algorithmic instability | An outlier in sales data inflates the mean, making forecasting models over‑predict demand | | Inconsistencies | Intractable joins; erroneous logic | Inconsistent date formats prevent merging transaction logs with marketing events | | Irrelevant features | Dimensionality curse; longer runtimes | Including an unused user‑agent field in a churn model wastes computational resources | > **Pro Tip:** Always start with a *data quality dashboard* that flags missing rates, duplicates, and key schema violations before moving to downstream analyses. --- ## 3.2 Handling Missing Values Missing data can be **random** (MCAR), **dependent on observed values** (MAR), or **dependent on unobserved values** (MNAR). The handling strategy should reflect the underlying mechanism. ### 3.2.1 Detecting Missingness python import pandas as pd df = pd.read_csv('sales.csv') # Quick summary of missingness missing_summary = df.isna().mean().sort_values(ascending=False) print(missing_summary) > **Tip:** Visualize missingness with a heatmap: python import seaborn as sns import matplotlib.pyplot as plt sns.heatmap(df.isna(), cbar=False) plt.title('Missing Value Heatmap') plt.show() ### 3.2.2 Strategies | Strategy | When to Use | Typical Implementation | |----------|-------------|------------------------| | **Drop** | Very few rows or columns are missing; missingness is random | `df.dropna()` or `df.dropna(axis=1, thresh=int(0.8*len(df)))` | | **Imputation – Mean/Median** | Continuous features with low missingness; MAR | `df['price'].fillna(df['price'].median(), inplace=True)` | | **Imputation – Mode** | Categorical features | `df['category'].fillna(df['category'].mode()[0], inplace=True)` | | **Model‑based Imputation** | Structured missingness; higher missing rate | `sklearn.impute.IterativeImputer` or `KNNImputer` | | **Create Missing Indicator** | Missingness carries information | `df['price_missing'] = df['price'].isna().astype(int)` | #### Example: Iterative Imputer python from sklearn.experimental import enable_iterative_imputer from sklearn.impute import IterativeImputer imputer = IterativeImputer(random_state=0) X_imputed = imputer.fit_transform(df[['price', 'quantity', 'discount']]) X_imputed = pd.DataFrame(X_imputed, columns=['price', 'quantity', 'discount']) --- ## 3.3 Outlier Detection and Treatment Outliers can arise from data entry errors, sensor glitches, or genuine extreme events. Treating them appropriately preserves model robustness. ### 3.3.1 Detection Techniques | Technique | Strengths | Weaknesses | |-----------|-----------|------------| | **Z‑score** | Simple; works for normal distributions | Sensitive to non‑normal data | | **IQR (Inter‑Quartile Range)** | Works for skewed data | Requires manual threshold tuning | | **Isolation Forest** | Unsupervised; captures arbitrary shapes | Needs tuning of `n_estimators` | | **Local Outlier Factor (LOF)** | Considers local density | Computationally expensive for large data | #### IQR Example python Q1 = df['sales'].quantile(0.25) Q3 = df['sales'].quantile(0.75) IQR = Q3 - Q1 lower = Q1 - 1.5 * IQR upper = Q3 + 1.5 * IQR outliers = df[(df['sales'] < lower) | (df['sales'] > upper)] print(f'Found {len(outliers)} outliers') ### 3.3.2 Treatment Options | Option | Use‑Case | Implementation | |--------|----------|----------------| | **Removal** | Clear data errors | `df = df[(df['sales'] >= lower) & (df['sales'] <= upper)]` | | **Winsorization** | Preserve data quantity; limit extremes | `from scipy.stats.mstats import winsorize` | | **Capping** | Set hard limits | `df['sales'] = df['sales'].clip(lower=lower, upper=upper)` | | **Flagging** | Keep raw value for later analysis | `df['sales_outlier'] = ((df['sales'] < lower) | (df['sales'] > upper)).astype(int)` | --- ## 3.4 Consistency and Standardization Inconsistent data (e.g., differing date formats, duplicate keys) hampers join operations and feature engineering. ### 3.4.1 Schema Validation python schema = { 'order_id': int, 'customer_id': int, 'order_date': 'datetime64[ns]', 'amount': float, 'currency': str } for col, dtype in schema.items(): df[col] = df[col].astype(dtype) ### 3.4.2 Normalization of Categorical Fields python # Standardize country codes df['country'] = df['country'].str.upper() # Collapse similar categories df['product_category'] = df['product_category'].replace({ 'Laptops': 'Computers', 'Notebooks': 'Computers', 'Cell Phones': 'Mobiles' }) ### 3.4.3 Duplicate Removal python # Identify exact duplicates duplicate_rows = df[df.duplicated(keep=False)] print(f'Found {duplicate_rows.shape[0]} duplicate rows') # Remove duplicates df = df.drop_duplicates() --- ## 3.5 Feature Engineering Basics Transforming raw observations into predictive signals is a core competency. ### 3.5.1 Deriving Temporal Features python df['order_month'] = df['order_date'].dt.month df['order_weekday'] = df['order_date'].dt.weekday ### 3.5.2 Binning Numerical Variables python # Income brackets bins = [0, 30000, 60000, 90000, np.inf] labels = ['Low', 'Medium', 'High', 'Very High'] df['income_bracket'] = pd.cut(df['annual_income'], bins=bins, labels=labels) ### 3.5.3 Encoding Categorical Variables | Encoding | Suitable For | Example | |----------|--------------|---------| | One‑Hot | Nominal | `pd.get_dummies(df['color'])` | | Ordinal | Ordered | `df['size'] = df['size'].map({'S':1,'M':2,'L':3})` | | Target | High cardinality | `CategoryEncoders` library | ### 3.5.4 Interaction Features python # Interaction between price and quantity df['price_quantity'] = df['price'] * df['quantity'] --- ## 3.6 Dimensionality Reduction High‑dimensional data can inflate variance, overfit models, and increase computational cost. Dimensionality reduction balances information retention with model efficiency. ### 3.6.1 Principal Component Analysis (PCA) python from sklearn.decomposition import PCA from sklearn.preprocessing import StandardScaler X = df.select_dtypes(include=[np.number]) X_scaled = StandardScaler().fit_transform(X) pca = PCA(n_components=0.95) # retain 95% variance X_pca = pca.fit_transform(X_scaled) print(f'Reduced from {X.shape[1]} to {X_pca.shape[1]} dimensions') ### 3.6.2 t‑Distributed Stochastic Neighbor Embedding (t‑SNE) > Use t‑SNE for **visualization** of high‑dimensional data; not suitable for downstream modeling due to non‑linearity and scalability issues. python from sklearn.manifold import TSNE tsne = TSNE(n_components=2, random_state=42) X_tsne = tsne.fit_transform(X_scaled) ### 3.6.3 Feature Selection Techniques | Technique | When to Use | Example | |-----------|-------------|---------| | **Filter** (e.g., chi‑square) | Quick, model‑agnostic | `SelectKBest(chi2, k=10)` | | **Wrapper** (e.g., Recursive Feature Elimination) | Model‑dependent | `RFE(estimator=RandomForestClassifier(), n_features_to_select=5)` | | **Embedded** (e.g., Lasso) | Regularization‑based | `Lasso(alpha=0.01).fit(X, y)` | --- ## 3.7 Pipeline Integration Automating cleaning steps ensures reproducibility and scalability. python from sklearn.pipeline import Pipeline from sklearn.impute import SimpleImputer from sklearn.preprocessing import StandardScaler pipeline = Pipeline([ ('imputer', SimpleImputer(strategy='median')), ('scaler', StandardScaler()), ('pca', PCA(n_components=0.95)) ]) X_processed = pipeline.fit_transform(df.select_dtypes(include=[np.number])) > **Best Practice:** Store the fitted pipeline (e.g., with `joblib`) and reuse it for production data to avoid data drift. --- ## 3.8 Summary * Cleaning is an iterative, context‑driven process that demands both statistical insight and domain knowledge. * Missing values, outliers, and inconsistencies should be identified early and handled with methods that preserve data integrity. * Feature engineering and dimensionality reduction turn raw observations into robust signals, improving model performance and interpretability. * Automating the pipeline safeguards consistency across experiments and production deployments. By mastering these techniques, you lay a strong foundation for reliable analytics, trustworthy models, and actionable business insights.

Chapter 2: Foundations of Data Acquisition

Chapter 4: Model Building & Validation