返回目錄
A
Data Science Unlocked: A Practical Guide for Modern Analysts - 第 7 章
Chapter 7: Advanced Topics – Deep Learning & NLP
發布於 2026-02-23 16:57
# Chapter 7: Advanced Topics – Deep Learning & NLP
Deep learning has become the engine behind the most sophisticated analytics in finance, healthcare, marketing, and beyond. This chapter delivers a hands‑on yet principled tour of neural network architectures—CNNs, RNNs, LSTMs, and transformers—and shows how to embed them into end‑to‑end NLP pipelines. The focus is on *what works*, *why it works*, and *how to implement it reliably*.
---
## 7.1 Neural‑Network Foundations
| Concept | Definition | Typical Use‑Case |
|---------|------------|------------------|
| **Perceptron** | A single‑layer linear classifier with an activation function. | Binary classification of linearly separable data. |
| **Multilayer Perceptron (MLP)** | Feed‑forward network with one or more hidden layers. | Tabular regression and classification. |
| **Activation** | Non‑linear function applied to layer output (ReLU, sigmoid, tanh). | Enables networks to model complex patterns. |
| **Loss Function** | Measures prediction error (MSE, Cross‑Entropy). | Drives weight updates during training. |
| **Back‑Propagation** | Gradient descent over network parameters. | Optimizes weights to minimize loss. |
| **Optimizer** | Algorithm to update weights (SGD, Adam, RMSProp). | Controls convergence speed and stability. |
### Minimal MLP Example (PyTorch)
python
import torch
import torch.nn as nn
import torch.optim as optim
# Simple MLP for the XOR problem
class XORNet(nn.Module):
def __init__(self):
super().__init__()
self.net = nn.Sequential(
nn.Linear(2, 8), nn.ReLU(),
nn.Linear(8, 1), nn.Sigmoid()
)
def forward(self, x):
return self.net(x)
model = XORNet()
criterion = nn.BCELoss()
optimizer = optim.Adam(model.parameters(), lr=0.01)
# Training loop
X = torch.tensor([[0.,0.],[0.,1.],[1.,0.],[1.,1.]])
y = torch.tensor([[0.],[1.],[1.],[0.]])
for epoch in range(5000):
optimizer.zero_grad()
output = model(X)
loss = criterion(output, y)
loss.backward()
optimizer.step()
if epoch % 500 == 0:
print(f"Epoch {epoch}: loss={loss.item():.4f}")
*Key takeaway:* Even a tiny network can learn non‑linear decision boundaries when combined with a suitable activation and optimizer.
---
## 7.2 Convolutional Neural Networks (CNNs)
CNNs exploit spatial locality by applying learnable filters across an input grid. They are the backbone of image classification, object detection, and increasingly, tabular data that can be reshaped into images.
| Layer | Purpose | Typical Parameters |
|-------|---------|--------------------|
| **Conv2D** | Feature extraction via learned kernels | `kernel_size`, `stride`, `padding` |
| **ReLU** | Non‑linearity | – |
| **MaxPool** | Down‑sampling & invariance | `pool_size`, `stride` |
| **Flatten** | Transition to dense layers | – |
| **Dense** | Classification | `units`, `activation` |
### Example: CIFAR‑10 Classification with Transfer Learning
python
import torch
import torch.nn as nn
import torch.optim as optim
from torchvision import datasets, transforms, models
# Data loaders
transform = transforms.Compose([
transforms.Resize(224),
transforms.ToTensor(),
transforms.Normalize(mean=[0.485, 0.456, 0.406],
std=[0.229, 0.224, 0.225])
])
train_loader = torch.utils.data.DataLoader(
datasets.CIFAR10(root='data', train=True, download=True, transform=transform),
batch_size=32, shuffle=True)
# Load pre‑trained ResNet18
model = models.resnet18(pretrained=True)
# Replace final layer for 10 classes
model.fc = nn.Linear(model.fc.in_features, 10)
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=1e-4)
# One epoch
model.train()
for imgs, labels in train_loader:
optimizer.zero_grad()
logits = model(imgs)
loss = criterion(logits, labels)
loss.backward()
optimizer.step()
**Practical Tips**
- **Data Augmentation** (random crops, flips) increases robustness.
- **Freeze early layers** to reduce training time.
- **Learning‑rate scheduling** (e.g., cosine decay) helps convergence.
---
## 7.3 Recurrent Neural Networks (RNNs) & LSTMs
RNNs process sequences by maintaining a hidden state that captures context. Standard RNNs suffer from vanishing/exploding gradients; gated units like LSTM and GRU mitigate this.
| Unit | Advantage | Typical Use‑Case |
|------|-----------|------------------|
| **RNN** | Simple sequence modeling | Small‑scale time‑series |
| **LSTM** | Handles long‑range dependencies | Sentiment analysis, speech recognition |
| **GRU** | Fewer parameters than LSTM | Real‑time translation |
### Sentiment Analysis on IMDB (Keras)
python
from tensorflow.keras.datasets import imdb
from tensorflow.keras.preprocessing import sequence
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, LSTM, Dense
vocab_size = 20000
maxlen = 500
(x_train, y_train), (x_test, y_test) = imdb.load_data(num_words=vocab_size)
x_train = sequence.pad_sequences(x_train, maxlen=maxlen)
x_test = sequence.pad_sequences(x_test, maxlen=maxlen)
model = Sequential([
Embedding(vocab_size, 128, input_length=maxlen),
LSTM(128, dropout=0.2, recurrent_dropout=0.2),
Dense(1, activation='sigmoid')
])
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
model.fit(x_train, y_train, epochs=3, batch_size=64, validation_split=0.2)
print('Test accuracy:', model.evaluate(x_test, y_test)[1])
*Key takeaway:* LSTMs can capture sentiment cues that depend on context across the entire review.
---
## 7.4 Transformers & Attention
Transformers discard recurrence in favor of *self‑attention*, enabling parallel training and superior context modeling. The encoder–decoder architecture underpins BERT (masked language modeling) and GPT (autoregressive generation).
| Model | Architecture | Core Idea |
|-------|--------------|-----------|
| **BERT** | Encoder‑only, bidirectional | Predict masked tokens |
| **GPT‑2/3** | Decoder‑only, autoregressive | Predict next token |
| **T5** | Encoder‑decoder, text‑to‑text | Unified text generation |
### Fine‑Tuning BERT for Sentiment Classification
python
from transformers import BertTokenizer, TFBertForSequenceClassification
import tensorflow as tf
# Load tokenizer & model
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = TFBertForSequenceClassification.from_pretrained('bert-base-uncased', num_labels=2)
# Dummy data
texts = ['I love this movie', 'This film is terrible']
labels = tf.constant([1, 0])
# Tokenization
inputs = tokenizer(texts, return_tensors='tf', padding=True, truncation=True)
# Training step
optimizer = tf.keras.optimizers.Adam(learning_rate=2e-5)
loss_fn = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)
@tf.function
def train_step():
with tf.GradientTape() as tape:
logits = model(**inputs).logits
loss = loss_fn(labels, logits)
grads = tape.gradient(loss, model.trainable_variables)
optimizer.apply_gradients(zip(grads, model.trainable_variables))
return loss
print('Initial loss:', train_step().numpy())
**Why Transformers?**
- *Parallelism*: training on GPUs/TPUs scales linearly with batch size.
- *Context depth*: attention weights capture relations across tokens regardless of distance.
- *Pre‑training*: massive corpora yield transferable knowledge.
---
## 7.5 End‑to‑End NLP Pipeline: From Text to Decision
Below is a consolidated flow that takes raw text, transforms it into embeddings, processes it with a transformer, and produces a probabilistic label.
Raw Text ➜ Tokenizer ➜ Input IDs & Attention Mask
│
▼
Transformer Encoder ➜ Contextual Embeddings
│
▼
Classification Head (Dense + Softmax)
│
▼
Probability Distribution (p(label))
### Pipeline Skeleton (Python)
python
# 1️⃣ Tokenization
inputs = tokenizer(texts, return_tensors='pt', padding=True, truncation=True)
# 2️⃣ Forward pass
outputs = model(**inputs)
probs = torch.softmax(outputs.logits, dim=-1)
# 3️⃣ Decision rule
predicted_labels = torch.argmax(probs, dim=-1)
**Batching Strategies**
| Strategy | When to Use |
|----------|-------------|
| **Padding to longest in batch** | GPU memory is abundant |
| **Dynamic padding** (bucketing) | Low‑latency inference |
| **Streaming inference** | Real‑time chatbots |
---
## 7.5 Interpreting and Debugging Neural Models
| Challenge | Diagnostic Tool | Action |
|-----------|-----------------|--------|
| **Gradient vanishing** | Inspect `grad_norm` | Use gated units or gradient clipping |
| **Over‑fitting** | Validation loss plateau | Early stopping or dropout |
| **Input noise** | Correlation matrix of embeddings | Augment data or add regularization |
| **Model bias** | Confusion matrix across demographic slices | Re‑balance training data or use counter‑factual examples |
---
## 7.6 Bias, Interpretability & Privacy in NLP
| Issue | Mitigation | Tool |
|-------|------------|------|
| **Data Bias** | Balanced sampling, fairness constraints | `fairlearn` library |
| **Interpretability** | Attention visualisation, LIME | `transformers` `attention` module |
| **Privacy** | Differential privacy during training | `dp‑tensorflow` |
| **Token‑level leaks** | Redact PII before tokenisation | regex + `spacy` |
python
# Visualising BERT attention for a single token
from transformers import BertTokenizer, BertModel
import matplotlib.pyplot as plt
import numpy as np
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertModel.from_pretrained('bert-base-uncased', output_attentions=True)
text = "The quick brown fox jumps over the lazy dog"
inputs = tokenizer(text, return_tensors='pt')
output = model(**inputs)
attentions = output.attentions[-1].detach().numpy()[0] # [heads, seq_len, seq_len]
# Average across heads
avg_attn = attentions.mean(axis=0)
plt.imshow(avg_attn, cmap='viridis')
plt.xticks(range(len(text.split())), text.split(), rotation=90)
plt.yticks(range(len(text.split())), text.split())
plt.title('BERT Self‑Attention')
plt.show()
Interpretability becomes a *post‑hoc* requirement in regulated industries. Visualising attention or extracting token‑level importance helps stakeholders trust the model.
---
## 7.7 Summary & Next Steps
| Concept | Key Insight |
|---------|-------------|
| **MLPs** | Effective for structured tabular data when combined with proper regularisation. |
| **CNNs** | Leverage local spatial patterns; transfer learning is a practical productivity booster. |
| **RNNs / LSTMs** | Essential for sequential data where order matters; gated units solve long‑range dependency problems. |
| **Transformers** | Parallel, context‑rich modeling that dominates NLP benchmarks; fine‑tuning is straightforward with `transformers`. |
| **Pipeline** | Tokenise → Embed → Contextualise (attention) → Classify / Generate → Deploy. |
**Action items for the reader**
1. **Implement** a BERT‑based classifier on a real dataset (e.g., product reviews). Tune learning rate and batch size.
2. **Experiment** with a lightweight CNN on a custom tabular dataset reshaped into 2‑D arrays.
3. **Deploy** the trained model as a REST service using TensorFlow‑Serving or FastAPI.
4. **Audit** the model for bias by inspecting per‑class performance across user demographics.
In the next chapter, we will translate these trained models into *production‑ready* services that can be integrated with Spark pipelines, orchestrated via Airflow, and monitored with Prometheus + Grafana.