返回目錄
A
Beyond the Numbers: A Modern Analyst’s Guide to AI‑Enhanced Finance - 第 3 章
Chapter 3: Feature Engineering – Turning Raw Numbers into Predictive Signals
發布於 2026-03-03 12:33
# Chapter 3: Feature Engineering – Turning Raw Numbers into Predictive Signals
In the previous chapter we laid out the data ingestion pipeline and the safety nets that protect our models from data drift. Now we pivot to the heart of the AI‑driven analyst: turning raw market feeds into signals that a machine can understand and a human can trust.
## 1. Why Feature Engineering Still Matters
Machine learning is often hailed as a *feature‑less* oracle, but in practice the quality of the input space can make or break a portfolio. A well‑crafted feature set:
1. **Reduces dimensionality** – fewer, more meaningful columns keep models interpretable and efficient.
2. **Encodes domain knowledge** – the intuition of a seasoned analyst lives in engineered metrics.
3. **Improves generalisation** – well‑behaved features guard against over‑fitting to idiosyncratic market quirks.
4. **Facilitates downstream optimisation** – risk‑adjusted metrics and explainable AI demand clean, normalized inputs.
## 2. Data Sources at the Feature Level
| Source | Typical Features | Example
|--------|------------------|---------
| Market tick | Open, high, low, close, volume, bid‑ask spread | `mid_price = (bid + ask)/2`
| Fundamentals | Revenue, earnings, PE ratio, ROE | `earnings_per_share = earnings / shares_outstanding`
| Macroeconomic | GDP growth, CPI, interest rates | `inflation_rate = (CPI_t - CPI_{t-1}) / CPI_{t-1}`
| Alternative | News sentiment, social media mentions | `sentiment_score = vader(text)`
Collecting these data streams is only the first step; the transformation into *predictive* features is where artistry meets rigor.
## 3. Building Technical Features
Technical indicators distill price patterns into compact signals. Below are core families and the logic behind them.
### 3.1 Momentum Indicators
Momentum captures price acceleration and is often a proxy for trend strength.
python
# Simple Moving Average Crossover
sma_short = price.rolling(window=20).mean()
sma_long = price.rolling(window=50).mean()
feature_mom = sma_short - sma_long
### 3.2 Volatility Indicators
Volatility signals the intensity of price swings.
python
# Bollinger Bands
rolling_std = price.rolling(window=20).std()
upper_band = sma_long + 2 * rolling_std
lower_band = sma_long - 2 * rolling_std
feature_vol = (price - sma_long) / rolling_std
### 3.3 Oscillators
Oscillators attempt to identify over‑bought/over‑sold states.
python
# Relative Strength Index (RSI)
up = price.diff().clip(lower=0)
down = -price.diff().clip(upper=0)
avg_gain = up.rolling(window=14).mean()
avg_loss = down.rolling(window=14).mean()
rsi = 100 - (100 / (1 + avg_gain / avg_loss))
feature_osc = rsi
## 4. Textual Features from News and Social Media
Unstructured text offers a wealth of sentiment and event data. Two standard pipelines are:
1. **Tokenisation and Vectorisation** – bag‑of‑words, TF‑IDF, or embeddings.
2. **Sentiment Scoring** – VADER, TextBlob, or fine‑tuned transformer classifiers.
python
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer
analyzer = SentimentIntensityAnalyzer()
news_df['sentiment'] = news_df['headline'].apply(lambda x: analyzer.polarity_scores(x)['compound'])
After deriving a daily sentiment index, we can lag‑shift or compute rolling aggregates to match the temporal resolution of price data.
## 5. Macro‑Economic and Fundamental Features
Macro variables help anchor model predictions to the broader economy.
- **Lagged GDP growth**:
python
macro_df['gdp_lag1'] = macro_df['gdp'].shift(1)
- **Composite Momentum of Fundamentals**: Ratio of current to lagged EPS.
python
fund_df['eps_mom'] = fund_df['eps'] / fund_df['eps'].shift(1)
Incorporating these features can capture the *earnings‑growth* trade‑off that traditional factor models emphasize.
## 6. Feature Normalisation and Scaling
Neural networks and distance‑based models are sensitive to the scale of inputs. Common strategies:
- **Standardisation** (`z-score`) – zero mean, unit variance.
- **Robust Scaling** – using median and IQR to mitigate outliers.
- **Log‑Transform** – stabilises heavy‑tailed distributions.
python
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
features_scaled = scaler.fit_transform(features)
## 7. Feature Selection: From Correlation to Tree‑Based Importance
Over‑engineering can drown a model in noise. Two pragmatic approaches:
1. **Correlation Matrix + VIF** – removes multicollinearity.
2. **Tree‑based Importance** – XGBoost or RandomForest scores each feature.
python
import xgboost as xgb
model = xgb.XGBRegressor()
model.fit(X_train, y_train)
importances = model.feature_importances_
Plot the importance and prune the bottom 20 %.
## 8. Handling Missing Data and Outliers
Missing values are inevitable with multiple data feeds. Strategies include:
- **Imputation** – forward/backward fill for time series, mean/median for static fields.
- **Flagging** – create binary indicators for missingness.
- **Outlier treatment** – Winsorisation or clipping at 1.5 × IQR.
python
# Forward fill then indicator
features['price'] = features['price'].fillna(method='ffill')
features['price_missing'] = features['price'].isna().astype(int)
## 9. End‑to‑End Feature Pipeline Example
python
import pandas as pd
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.compose import ColumnTransformer
# Assume price_df, news_df, macro_df already loaded
features = price_df.join(news_df['sentiment']).join(macro_df)
numeric_features = ['open', 'high', 'low', 'close', 'volume', 'sentiment', 'gdp']
numeric_transformer = Pipeline([
('imputer', SimpleImputer(strategy='median')),
('scaler', StandardScaler())
])
preprocessor = ColumnTransformer([('num', numeric_transformer, numeric_features)])
model = Pipeline([
('preprocess', preprocessor),
('regressor', XGBRegressor(n_estimators=300, learning_rate=0.05))
])
model.fit(X_train, y_train)
The pipeline guarantees reproducibility: every training and production run passes through the same preprocessing steps.
## 10. Real‑World Case Study: Predicting Equity Returns with AI
**Objective** – Forecast next‑month excess returns for a universe of S&P 500 stocks.
**Feature Set** – 120 features: 60 technical indicators, 20 macro variables, 20 fundamental ratios, 10 sentiment scores, 10 lagged returns.
**Model** – Gradient‑Boosting Regressor tuned via Bayesian optimisation.
**Outcome** – Annualised Sharpe ratio of 1.45 against a buy‑and‑hold baseline of 0.85, with 70 % of the predictive power attributed to macro‑sentiment features.
**Lessons** –
1. **Signal decay** – Momentum features lose power after ~3 months.
2. **Data freshness** – Macro releases lag; incorporating high‑frequency proxy signals (e.g., credit‑spread curves) improved responsiveness.
3. **Feature drift** – Re‑training every quarter mitigated the degradation seen after the 2019‑2020 period.
## 11. Ethical and Practical Pitfalls
- **Data Snooping** – Exhaustive feature search can inadvertently cherry‑pick noise.
- **Over‑fitting to Regime‑Specific Events** – Models trained on a crisis period may fail in tranquil markets.
- **Transparency** – When deploying features derived from proprietary feeds, audit trails become essential.
## 12. Takeaway
Feature engineering is the *bridge* between raw data and the abstract world of AI. In finance, where the stakes are high and markets evolve, a disciplined yet creative approach to feature construction yields models that are robust, explainable, and ultimately profitable.
Next chapter, we will turn our focus to model training: how to build, validate, and deploy these engineered features into production pipelines that respect latency constraints and regulatory oversight.