Chapter 3: Feature Engineering – Turning Raw Numbers into Predictive Signals

發布於 2026-03-03 12:33

# Chapter 3: Feature Engineering – Turning Raw Numbers into Predictive Signals In the previous chapter we laid out the data ingestion pipeline and the safety nets that protect our models from data drift. Now we pivot to the heart of the AI‑driven analyst: turning raw market feeds into signals that a machine can understand and a human can trust. ## 1. Why Feature Engineering Still Matters Machine learning is often hailed as a *feature‑less* oracle, but in practice the quality of the input space can make or break a portfolio. A well‑crafted feature set: 1. **Reduces dimensionality** – fewer, more meaningful columns keep models interpretable and efficient. 2. **Encodes domain knowledge** – the intuition of a seasoned analyst lives in engineered metrics. 3. **Improves generalisation** – well‑behaved features guard against over‑fitting to idiosyncratic market quirks. 4. **Facilitates downstream optimisation** – risk‑adjusted metrics and explainable AI demand clean, normalized inputs. ## 2. Data Sources at the Feature Level | Source | Typical Features | Example |--------|------------------|--------- | Market tick | Open, high, low, close, volume, bid‑ask spread | `mid_price = (bid + ask)/2` | Fundamentals | Revenue, earnings, PE ratio, ROE | `earnings_per_share = earnings / shares_outstanding` | Macroeconomic | GDP growth, CPI, interest rates | `inflation_rate = (CPI_t - CPI_{t-1}) / CPI_{t-1}` | Alternative | News sentiment, social media mentions | `sentiment_score = vader(text)` Collecting these data streams is only the first step; the transformation into *predictive* features is where artistry meets rigor. ## 3. Building Technical Features Technical indicators distill price patterns into compact signals. Below are core families and the logic behind them. ### 3.1 Momentum Indicators Momentum captures price acceleration and is often a proxy for trend strength. python # Simple Moving Average Crossover sma_short = price.rolling(window=20).mean() sma_long = price.rolling(window=50).mean() feature_mom = sma_short - sma_long ### 3.2 Volatility Indicators Volatility signals the intensity of price swings. python # Bollinger Bands rolling_std = price.rolling(window=20).std() upper_band = sma_long + 2 * rolling_std lower_band = sma_long - 2 * rolling_std feature_vol = (price - sma_long) / rolling_std ### 3.3 Oscillators Oscillators attempt to identify over‑bought/over‑sold states. python # Relative Strength Index (RSI) up = price.diff().clip(lower=0) down = -price.diff().clip(upper=0) avg_gain = up.rolling(window=14).mean() avg_loss = down.rolling(window=14).mean() rsi = 100 - (100 / (1 + avg_gain / avg_loss)) feature_osc = rsi ## 4. Textual Features from News and Social Media Unstructured text offers a wealth of sentiment and event data. Two standard pipelines are: 1. **Tokenisation and Vectorisation** – bag‑of‑words, TF‑IDF, or embeddings. 2. **Sentiment Scoring** – VADER, TextBlob, or fine‑tuned transformer classifiers. python from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer analyzer = SentimentIntensityAnalyzer() news_df['sentiment'] = news_df['headline'].apply(lambda x: analyzer.polarity_scores(x)['compound']) After deriving a daily sentiment index, we can lag‑shift or compute rolling aggregates to match the temporal resolution of price data. ## 5. Macro‑Economic and Fundamental Features Macro variables help anchor model predictions to the broader economy. - **Lagged GDP growth**: python macro_df['gdp_lag1'] = macro_df['gdp'].shift(1) - **Composite Momentum of Fundamentals**: Ratio of current to lagged EPS. python fund_df['eps_mom'] = fund_df['eps'] / fund_df['eps'].shift(1) Incorporating these features can capture the *earnings‑growth* trade‑off that traditional factor models emphasize. ## 6. Feature Normalisation and Scaling Neural networks and distance‑based models are sensitive to the scale of inputs. Common strategies: - **Standardisation** (`z-score`) – zero mean, unit variance. - **Robust Scaling** – using median and IQR to mitigate outliers. - **Log‑Transform** – stabilises heavy‑tailed distributions. python from sklearn.preprocessing import StandardScaler scaler = StandardScaler() features_scaled = scaler.fit_transform(features) ## 7. Feature Selection: From Correlation to Tree‑Based Importance Over‑engineering can drown a model in noise. Two pragmatic approaches: 1. **Correlation Matrix + VIF** – removes multicollinearity. 2. **Tree‑based Importance** – XGBoost or RandomForest scores each feature. python import xgboost as xgb model = xgb.XGBRegressor() model.fit(X_train, y_train) importances = model.feature_importances_ Plot the importance and prune the bottom 20 %. ## 8. Handling Missing Data and Outliers Missing values are inevitable with multiple data feeds. Strategies include: - **Imputation** – forward/backward fill for time series, mean/median for static fields. - **Flagging** – create binary indicators for missingness. - **Outlier treatment** – Winsorisation or clipping at 1.5 × IQR. python # Forward fill then indicator features['price'] = features['price'].fillna(method='ffill') features['price_missing'] = features['price'].isna().astype(int) ## 9. End‑to‑End Feature Pipeline Example python import pandas as pd from sklearn.pipeline import Pipeline from sklearn.preprocessing import StandardScaler from sklearn.compose import ColumnTransformer # Assume price_df, news_df, macro_df already loaded features = price_df.join(news_df['sentiment']).join(macro_df) numeric_features = ['open', 'high', 'low', 'close', 'volume', 'sentiment', 'gdp'] numeric_transformer = Pipeline([ ('imputer', SimpleImputer(strategy='median')), ('scaler', StandardScaler()) ]) preprocessor = ColumnTransformer([('num', numeric_transformer, numeric_features)]) model = Pipeline([ ('preprocess', preprocessor), ('regressor', XGBRegressor(n_estimators=300, learning_rate=0.05)) ]) model.fit(X_train, y_train) The pipeline guarantees reproducibility: every training and production run passes through the same preprocessing steps. ## 10. Real‑World Case Study: Predicting Equity Returns with AI **Objective** – Forecast next‑month excess returns for a universe of S&P 500 stocks. **Feature Set** – 120 features: 60 technical indicators, 20 macro variables, 20 fundamental ratios, 10 sentiment scores, 10 lagged returns. **Model** – Gradient‑Boosting Regressor tuned via Bayesian optimisation. **Outcome** – Annualised Sharpe ratio of 1.45 against a buy‑and‑hold baseline of 0.85, with 70 % of the predictive power attributed to macro‑sentiment features. **Lessons** – 1. **Signal decay** – Momentum features lose power after ~3 months. 2. **Data freshness** – Macro releases lag; incorporating high‑frequency proxy signals (e.g., credit‑spread curves) improved responsiveness. 3. **Feature drift** – Re‑training every quarter mitigated the degradation seen after the 2019‑2020 period. ## 11. Ethical and Practical Pitfalls - **Data Snooping** – Exhaustive feature search can inadvertently cherry‑pick noise. - **Over‑fitting to Regime‑Specific Events** – Models trained on a crisis period may fail in tranquil markets. - **Transparency** – When deploying features derived from proprietary feeds, audit trails become essential. ## 12. Takeaway Feature engineering is the *bridge* between raw data and the abstract world of AI. In finance, where the stakes are high and markets evolve, a disciplined yet creative approach to feature construction yields models that are robust, explainable, and ultimately profitable. Next chapter, we will turn our focus to model training: how to build, validate, and deploy these engineered features into production pipelines that respect latency constraints and regulatory oversight.

Chapter 2: Data Architecture for Finance

Chapter 4: From Features to Models – Training, Validation, and Deployment in the Financial Arena