返回目錄
A
Unveiling Insight: Data Science for Strategic Decision‑Making - 第 3 章
Chapter 3: Cleaning and Preparing Data
發布於 2026-03-07 18:27
# Chapter 3: Cleaning and Preparing Data
In this chapter we bridge the gap between raw data acquisition and the analytical or modeling steps that follow. Data cleaning is often the most time‑consuming yet most critical part of any data science project. A clean dataset not only speeds up experimentation but also safeguards against biased or misleading insights.
---
## 3.1 Why Cleaning Matters
| **Aspect** | **Impact** | **Business Example** |
|------------|------------|---------------------|
| Missing values | Inaccurate statistics; model bias | A customer‑segmentation model trained on incomplete purchase histories misclassifies high‑value customers, leading to lost revenue |
| Outliers | Skewed distributions; algorithmic instability | An outlier in sales data inflates the mean, making forecasting models over‑predict demand |
| Inconsistencies | Intractable joins; erroneous logic | Inconsistent date formats prevent merging transaction logs with marketing events |
| Irrelevant features | Dimensionality curse; longer runtimes | Including an unused user‑agent field in a churn model wastes computational resources |
> **Pro Tip:** Always start with a *data quality dashboard* that flags missing rates, duplicates, and key schema violations before moving to downstream analyses.
---
## 3.2 Handling Missing Values
Missing data can be **random** (MCAR), **dependent on observed values** (MAR), or **dependent on unobserved values** (MNAR). The handling strategy should reflect the underlying mechanism.
### 3.2.1 Detecting Missingness
python
import pandas as pd
df = pd.read_csv('sales.csv')
# Quick summary of missingness
missing_summary = df.isna().mean().sort_values(ascending=False)
print(missing_summary)
> **Tip:** Visualize missingness with a heatmap:
python
import seaborn as sns
import matplotlib.pyplot as plt
sns.heatmap(df.isna(), cbar=False)
plt.title('Missing Value Heatmap')
plt.show()
### 3.2.2 Strategies
| Strategy | When to Use | Typical Implementation |
|----------|-------------|------------------------|
| **Drop** | Very few rows or columns are missing; missingness is random | `df.dropna()` or `df.dropna(axis=1, thresh=int(0.8*len(df)))` |
| **Imputation – Mean/Median** | Continuous features with low missingness; MAR | `df['price'].fillna(df['price'].median(), inplace=True)` |
| **Imputation – Mode** | Categorical features | `df['category'].fillna(df['category'].mode()[0], inplace=True)` |
| **Model‑based Imputation** | Structured missingness; higher missing rate | `sklearn.impute.IterativeImputer` or `KNNImputer` |
| **Create Missing Indicator** | Missingness carries information | `df['price_missing'] = df['price'].isna().astype(int)` |
#### Example: Iterative Imputer
python
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer
imputer = IterativeImputer(random_state=0)
X_imputed = imputer.fit_transform(df[['price', 'quantity', 'discount']])
X_imputed = pd.DataFrame(X_imputed, columns=['price', 'quantity', 'discount'])
---
## 3.3 Outlier Detection and Treatment
Outliers can arise from data entry errors, sensor glitches, or genuine extreme events. Treating them appropriately preserves model robustness.
### 3.3.1 Detection Techniques
| Technique | Strengths | Weaknesses |
|-----------|-----------|------------|
| **Z‑score** | Simple; works for normal distributions | Sensitive to non‑normal data |
| **IQR (Inter‑Quartile Range)** | Works for skewed data | Requires manual threshold tuning |
| **Isolation Forest** | Unsupervised; captures arbitrary shapes | Needs tuning of `n_estimators` |
| **Local Outlier Factor (LOF)** | Considers local density | Computationally expensive for large data |
#### IQR Example
python
Q1 = df['sales'].quantile(0.25)
Q3 = df['sales'].quantile(0.75)
IQR = Q3 - Q1
lower = Q1 - 1.5 * IQR
upper = Q3 + 1.5 * IQR
outliers = df[(df['sales'] < lower) | (df['sales'] > upper)]
print(f'Found {len(outliers)} outliers')
### 3.3.2 Treatment Options
| Option | Use‑Case | Implementation |
|--------|----------|----------------|
| **Removal** | Clear data errors | `df = df[(df['sales'] >= lower) & (df['sales'] <= upper)]` |
| **Winsorization** | Preserve data quantity; limit extremes | `from scipy.stats.mstats import winsorize` |
| **Capping** | Set hard limits | `df['sales'] = df['sales'].clip(lower=lower, upper=upper)` |
| **Flagging** | Keep raw value for later analysis | `df['sales_outlier'] = ((df['sales'] < lower) | (df['sales'] > upper)).astype(int)` |
---
## 3.4 Consistency and Standardization
Inconsistent data (e.g., differing date formats, duplicate keys) hampers join operations and feature engineering.
### 3.4.1 Schema Validation
python
schema = {
'order_id': int,
'customer_id': int,
'order_date': 'datetime64[ns]',
'amount': float,
'currency': str
}
for col, dtype in schema.items():
df[col] = df[col].astype(dtype)
### 3.4.2 Normalization of Categorical Fields
python
# Standardize country codes
df['country'] = df['country'].str.upper()
# Collapse similar categories
df['product_category'] = df['product_category'].replace({
'Laptops': 'Computers',
'Notebooks': 'Computers',
'Cell Phones': 'Mobiles'
})
### 3.4.3 Duplicate Removal
python
# Identify exact duplicates
duplicate_rows = df[df.duplicated(keep=False)]
print(f'Found {duplicate_rows.shape[0]} duplicate rows')
# Remove duplicates
df = df.drop_duplicates()
---
## 3.5 Feature Engineering Basics
Transforming raw observations into predictive signals is a core competency.
### 3.5.1 Deriving Temporal Features
python
df['order_month'] = df['order_date'].dt.month
df['order_weekday'] = df['order_date'].dt.weekday
### 3.5.2 Binning Numerical Variables
python
# Income brackets
bins = [0, 30000, 60000, 90000, np.inf]
labels = ['Low', 'Medium', 'High', 'Very High']
df['income_bracket'] = pd.cut(df['annual_income'], bins=bins, labels=labels)
### 3.5.3 Encoding Categorical Variables
| Encoding | Suitable For | Example |
|----------|--------------|---------|
| One‑Hot | Nominal | `pd.get_dummies(df['color'])` |
| Ordinal | Ordered | `df['size'] = df['size'].map({'S':1,'M':2,'L':3})` |
| Target | High cardinality | `CategoryEncoders` library |
### 3.5.4 Interaction Features
python
# Interaction between price and quantity
df['price_quantity'] = df['price'] * df['quantity']
---
## 3.6 Dimensionality Reduction
High‑dimensional data can inflate variance, overfit models, and increase computational cost. Dimensionality reduction balances information retention with model efficiency.
### 3.6.1 Principal Component Analysis (PCA)
python
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
X = df.select_dtypes(include=[np.number])
X_scaled = StandardScaler().fit_transform(X)
pca = PCA(n_components=0.95) # retain 95% variance
X_pca = pca.fit_transform(X_scaled)
print(f'Reduced from {X.shape[1]} to {X_pca.shape[1]} dimensions')
### 3.6.2 t‑Distributed Stochastic Neighbor Embedding (t‑SNE)
> Use t‑SNE for **visualization** of high‑dimensional data; not suitable for downstream modeling due to non‑linearity and scalability issues.
python
from sklearn.manifold import TSNE
tsne = TSNE(n_components=2, random_state=42)
X_tsne = tsne.fit_transform(X_scaled)
### 3.6.3 Feature Selection Techniques
| Technique | When to Use | Example |
|-----------|-------------|---------|
| **Filter** (e.g., chi‑square) | Quick, model‑agnostic | `SelectKBest(chi2, k=10)` |
| **Wrapper** (e.g., Recursive Feature Elimination) | Model‑dependent | `RFE(estimator=RandomForestClassifier(), n_features_to_select=5)` |
| **Embedded** (e.g., Lasso) | Regularization‑based | `Lasso(alpha=0.01).fit(X, y)` |
---
## 3.7 Pipeline Integration
Automating cleaning steps ensures reproducibility and scalability.
python
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler
pipeline = Pipeline([
('imputer', SimpleImputer(strategy='median')),
('scaler', StandardScaler()),
('pca', PCA(n_components=0.95))
])
X_processed = pipeline.fit_transform(df.select_dtypes(include=[np.number]))
> **Best Practice:** Store the fitted pipeline (e.g., with `joblib`) and reuse it for production data to avoid data drift.
---
## 3.8 Summary
* Cleaning is an iterative, context‑driven process that demands both statistical insight and domain knowledge.
* Missing values, outliers, and inconsistencies should be identified early and handled with methods that preserve data integrity.
* Feature engineering and dimensionality reduction turn raw observations into robust signals, improving model performance and interpretability.
* Automating the pipeline safeguards consistency across experiments and production deployments.
By mastering these techniques, you lay a strong foundation for reliable analytics, trustworthy models, and actionable business insights.