Chapter 7: Advanced Topics – Deep Learning & NLP

發布於 2026-02-23 16:57

# Chapter 7: Advanced Topics – Deep Learning & NLP Deep learning has become the engine behind the most sophisticated analytics in finance, healthcare, marketing, and beyond. This chapter delivers a hands‑on yet principled tour of neural network architectures—CNNs, RNNs, LSTMs, and transformers—and shows how to embed them into end‑to‑end NLP pipelines. The focus is on *what works*, *why it works*, and *how to implement it reliably*. --- ## 7.1 Neural‑Network Foundations | Concept | Definition | Typical Use‑Case | |---------|------------|------------------| | **Perceptron** | A single‑layer linear classifier with an activation function. | Binary classification of linearly separable data. | | **Multilayer Perceptron (MLP)** | Feed‑forward network with one or more hidden layers. | Tabular regression and classification. | | **Activation** | Non‑linear function applied to layer output (ReLU, sigmoid, tanh). | Enables networks to model complex patterns. | | **Loss Function** | Measures prediction error (MSE, Cross‑Entropy). | Drives weight updates during training. | | **Back‑Propagation** | Gradient descent over network parameters. | Optimizes weights to minimize loss. | | **Optimizer** | Algorithm to update weights (SGD, Adam, RMSProp). | Controls convergence speed and stability. | ### Minimal MLP Example (PyTorch) python import torch import torch.nn as nn import torch.optim as optim # Simple MLP for the XOR problem class XORNet(nn.Module): def __init__(self): super().__init__() self.net = nn.Sequential( nn.Linear(2, 8), nn.ReLU(), nn.Linear(8, 1), nn.Sigmoid() ) def forward(self, x): return self.net(x) model = XORNet() criterion = nn.BCELoss() optimizer = optim.Adam(model.parameters(), lr=0.01) # Training loop X = torch.tensor([[0.,0.],[0.,1.],[1.,0.],[1.,1.]]) y = torch.tensor([[0.],[1.],[1.],[0.]]) for epoch in range(5000): optimizer.zero_grad() output = model(X) loss = criterion(output, y) loss.backward() optimizer.step() if epoch % 500 == 0: print(f"Epoch {epoch}: loss={loss.item():.4f}") *Key takeaway:* Even a tiny network can learn non‑linear decision boundaries when combined with a suitable activation and optimizer. --- ## 7.2 Convolutional Neural Networks (CNNs) CNNs exploit spatial locality by applying learnable filters across an input grid. They are the backbone of image classification, object detection, and increasingly, tabular data that can be reshaped into images. | Layer | Purpose | Typical Parameters | |-------|---------|--------------------| | **Conv2D** | Feature extraction via learned kernels | `kernel_size`, `stride`, `padding` | | **ReLU** | Non‑linearity | – | | **MaxPool** | Down‑sampling & invariance | `pool_size`, `stride` | | **Flatten** | Transition to dense layers | – | | **Dense** | Classification | `units`, `activation` | ### Example: CIFAR‑10 Classification with Transfer Learning python import torch import torch.nn as nn import torch.optim as optim from torchvision import datasets, transforms, models # Data loaders transform = transforms.Compose([ transforms.Resize(224), transforms.ToTensor(), transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]) ]) train_loader = torch.utils.data.DataLoader( datasets.CIFAR10(root='data', train=True, download=True, transform=transform), batch_size=32, shuffle=True) # Load pre‑trained ResNet18 model = models.resnet18(pretrained=True) # Replace final layer for 10 classes model.fc = nn.Linear(model.fc.in_features, 10) criterion = nn.CrossEntropyLoss() optimizer = optim.Adam(model.parameters(), lr=1e-4) # One epoch model.train() for imgs, labels in train_loader: optimizer.zero_grad() logits = model(imgs) loss = criterion(logits, labels) loss.backward() optimizer.step() **Practical Tips** - **Data Augmentation** (random crops, flips) increases robustness. - **Freeze early layers** to reduce training time. - **Learning‑rate scheduling** (e.g., cosine decay) helps convergence. --- ## 7.3 Recurrent Neural Networks (RNNs) & LSTMs RNNs process sequences by maintaining a hidden state that captures context. Standard RNNs suffer from vanishing/exploding gradients; gated units like LSTM and GRU mitigate this. | Unit | Advantage | Typical Use‑Case | |------|-----------|------------------| | **RNN** | Simple sequence modeling | Small‑scale time‑series | | **LSTM** | Handles long‑range dependencies | Sentiment analysis, speech recognition | | **GRU** | Fewer parameters than LSTM | Real‑time translation | ### Sentiment Analysis on IMDB (Keras) python from tensorflow.keras.datasets import imdb from tensorflow.keras.preprocessing import sequence from tensorflow.keras.models import Sequential from tensorflow.keras.layers import Embedding, LSTM, Dense vocab_size = 20000 maxlen = 500 (x_train, y_train), (x_test, y_test) = imdb.load_data(num_words=vocab_size) x_train = sequence.pad_sequences(x_train, maxlen=maxlen) x_test = sequence.pad_sequences(x_test, maxlen=maxlen) model = Sequential([ Embedding(vocab_size, 128, input_length=maxlen), LSTM(128, dropout=0.2, recurrent_dropout=0.2), Dense(1, activation='sigmoid') ]) model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy']) model.fit(x_train, y_train, epochs=3, batch_size=64, validation_split=0.2) print('Test accuracy:', model.evaluate(x_test, y_test)[1]) *Key takeaway:* LSTMs can capture sentiment cues that depend on context across the entire review. --- ## 7.4 Transformers & Attention Transformers discard recurrence in favor of *self‑attention*, enabling parallel training and superior context modeling. The encoder–decoder architecture underpins BERT (masked language modeling) and GPT (autoregressive generation). | Model | Architecture | Core Idea | |-------|--------------|-----------| | **BERT** | Encoder‑only, bidirectional | Predict masked tokens | | **GPT‑2/3** | Decoder‑only, autoregressive | Predict next token | | **T5** | Encoder‑decoder, text‑to‑text | Unified text generation | ### Fine‑Tuning BERT for Sentiment Classification python from transformers import BertTokenizer, TFBertForSequenceClassification import tensorflow as tf # Load tokenizer & model tokenizer = BertTokenizer.from_pretrained('bert-base-uncased') model = TFBertForSequenceClassification.from_pretrained('bert-base-uncased', num_labels=2) # Dummy data texts = ['I love this movie', 'This film is terrible'] labels = tf.constant([1, 0]) # Tokenization inputs = tokenizer(texts, return_tensors='tf', padding=True, truncation=True) # Training step optimizer = tf.keras.optimizers.Adam(learning_rate=2e-5) loss_fn = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True) @tf.function def train_step(): with tf.GradientTape() as tape: logits = model(**inputs).logits loss = loss_fn(labels, logits) grads = tape.gradient(loss, model.trainable_variables) optimizer.apply_gradients(zip(grads, model.trainable_variables)) return loss print('Initial loss:', train_step().numpy()) **Why Transformers?** - *Parallelism*: training on GPUs/TPUs scales linearly with batch size. - *Context depth*: attention weights capture relations across tokens regardless of distance. - *Pre‑training*: massive corpora yield transferable knowledge. --- ## 7.5 End‑to‑End NLP Pipeline: From Text to Decision Below is a consolidated flow that takes raw text, transforms it into embeddings, processes it with a transformer, and produces a probabilistic label. Raw Text ➜ Tokenizer ➜ Input IDs & Attention Mask │ ▼ Transformer Encoder ➜ Contextual Embeddings │ ▼ Classification Head (Dense + Softmax) │ ▼ Probability Distribution (p(label)) ### Pipeline Skeleton (Python) python # 1️⃣ Tokenization inputs = tokenizer(texts, return_tensors='pt', padding=True, truncation=True) # 2️⃣ Forward pass outputs = model(**inputs) probs = torch.softmax(outputs.logits, dim=-1) # 3️⃣ Decision rule predicted_labels = torch.argmax(probs, dim=-1) **Batching Strategies** | Strategy | When to Use | |----------|-------------| | **Padding to longest in batch** | GPU memory is abundant | | **Dynamic padding** (bucketing) | Low‑latency inference | | **Streaming inference** | Real‑time chatbots | --- ## 7.5 Interpreting and Debugging Neural Models | Challenge | Diagnostic Tool | Action | |-----------|-----------------|--------| | **Gradient vanishing** | Inspect `grad_norm` | Use gated units or gradient clipping | | **Over‑fitting** | Validation loss plateau | Early stopping or dropout | | **Input noise** | Correlation matrix of embeddings | Augment data or add regularization | | **Model bias** | Confusion matrix across demographic slices | Re‑balance training data or use counter‑factual examples | --- ## 7.6 Bias, Interpretability & Privacy in NLP | Issue | Mitigation | Tool | |-------|------------|------| | **Data Bias** | Balanced sampling, fairness constraints | `fairlearn` library | | **Interpretability** | Attention visualisation, LIME | `transformers` `attention` module | | **Privacy** | Differential privacy during training | `dp‑tensorflow` | | **Token‑level leaks** | Redact PII before tokenisation | regex + `spacy` | python # Visualising BERT attention for a single token from transformers import BertTokenizer, BertModel import matplotlib.pyplot as plt import numpy as np tokenizer = BertTokenizer.from_pretrained('bert-base-uncased') model = BertModel.from_pretrained('bert-base-uncased', output_attentions=True) text = "The quick brown fox jumps over the lazy dog" inputs = tokenizer(text, return_tensors='pt') output = model(**inputs) attentions = output.attentions[-1].detach().numpy()[0] # [heads, seq_len, seq_len] # Average across heads avg_attn = attentions.mean(axis=0) plt.imshow(avg_attn, cmap='viridis') plt.xticks(range(len(text.split())), text.split(), rotation=90) plt.yticks(range(len(text.split())), text.split()) plt.title('BERT Self‑Attention') plt.show() Interpretability becomes a *post‑hoc* requirement in regulated industries. Visualising attention or extracting token‑level importance helps stakeholders trust the model. --- ## 7.7 Summary & Next Steps | Concept | Key Insight | |---------|-------------| | **MLPs** | Effective for structured tabular data when combined with proper regularisation. | | **CNNs** | Leverage local spatial patterns; transfer learning is a practical productivity booster. | | **RNNs / LSTMs** | Essential for sequential data where order matters; gated units solve long‑range dependency problems. | | **Transformers** | Parallel, context‑rich modeling that dominates NLP benchmarks; fine‑tuning is straightforward with `transformers`. | | **Pipeline** | Tokenise → Embed → Contextualise (attention) → Classify / Generate → Deploy. | **Action items for the reader** 1. **Implement** a BERT‑based classifier on a real dataset (e.g., product reviews). Tune learning rate and batch size. 2. **Experiment** with a lightweight CNN on a custom tabular dataset reshaped into 2‑D arrays. 3. **Deploy** the trained model as a REST service using TensorFlow‑Serving or FastAPI. 4. **Audit** the model for bias by inspecting per‑class performance across user demographics. In the next chapter, we will translate these trained models into *production‑ready* services that can be integrated with Spark pipelines, orchestrated via Airflow, and monitored with Prometheus + Grafana.

Chapter 6: Evaluate – Turning Numbers into Insight

Chapter 8: From Models to Production – Deploying, Orchestrating, and Monitoring