Preprocessing Pipeline for Transformer Models with Attention Mechanisms
1. Text Tokenization
Concept
Convert raw text into numerical tokens that the model can process.
Steps
- Split text into words/subwords
- Map tokens to vocabulary indices
- Handle special tokens ([CLS], [SEP], [PAD])
Example
from transformers import BertTokenizer
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
text = "The cat sat on the mat."
tokens = tokenizer.tokenize(text)
token_ids = tokenizer.convert_tokens_to_ids(tokens)
print("Original:", text)
print("Tokens:", tokens)
print("Token IDs:", token_ids)
Output:
Original: The cat sat on the mat.
Tokens: ['the', 'cat', 'sat', 'on', 'the', 'mat', '.']
Token IDs: [1996, 4937, 2038, 2006, 1996, 4812, 1012]
2. Sequence Padding/Truncation
Concept
Ensure all sequences have uniform length for batch processing.
Steps
- Pad shorter sequences with [PAD] tokens
- Truncate longer sequences to max length
Example
max_length = 10
padded_ids = token_ids + [tokenizer.pad_token_id] * (max_length - len(token_ids))
attention_mask = [1] * len(token_ids) + [0] * (max_length - len(token_ids))
print("Padded IDs:", padded_ids)
print("Attention Mask:", attention_mask)
Output:
Padded IDs: [1996, 4937, 2038, 2006, 1996, 4812, 1012, 0, 0, 0]
Attention Mask: [1, 1, 1, 1, 1, 1, 1, 0, 0, 0]
Visualization
graph LR
A["Before Padding<br/>[1996, 4937, 2038, 2006, 1996, 4812, 1012]"] -->|Padding to length 10| B["After Padding<br/>[1996, 4937, 2038, 2006, 1996, 4812, 1012, 0, 0, 0]"]
B --> C["Attention Mask<br/>[1, 1, 1, 1, 1, 1, 1, 0, 0, 0]"]
classDef data fill:#f9ffff,stroke:#333,stroke-width:1px;
class A,B,C data;
style A fill:#0000aa,stroke:#0066cc
style B fill:#0000ee,stroke:#009900
style C fill:#ff0000,stroke:#cc0000
3. Positional Encoding
Concept
Add information about token positions since Transformers don't have inherent sequence awareness.
Formula
PE(pos,2i) = sin(pos/10000^(2i/d_model))
PE(pos,2i+1) = cos(pos/10000^(2i/d_model))
PyTorch Implementation
import torch
import math
def positional_encoding(max_len, d_model):
position = torch.arange(max_len).unsqueeze(1)
div_term = torch.exp(torch.arange(0, d_model, 2) * (-math.log(10000.0) / d_model)
pe = torch.zeros(max_len, d_model)
pe[:, 0::2] = torch.sin(position * div_term)
pe[:, 1::2] = torch.cos(position * div_term)
return pe
d_model = 512
pe = positional_encoding(max_length, d_model)
print("Positional Encoding Shape:", pe.shape)
Token Embeddings: [E1, E2, E3, ..., En]
Positional Encodings: [PE1, PE2, PE3, ..., PEn]
Final Input: [E1+PE1, E2+PE2, E3+PE3, ..., En+PEn]
Visualization
graph TD
A["Token Embeddings<br/>[E₁, E₂, E₃, ..., Eₙ]"] --> C["Final Input<br/>[E₁+PE₁, E₂+PE₂, E₃+PE₃, ..., Eₙ+PEₙ]"]
B["Positional Encodings<br/>[PE₁, PE₂, PE₃, ..., PEₙ]"] --> C
classDef embeddings fill:#e60,stroke:#0066cc,stroke-width:2px
classDef positional fill:#e60,stroke:#f1c232,stroke-width:2px
classDef final fill:#e60,stroke:#82b366,stroke-width:2px
class A embeddings
class B positional
class C final
style A text-align:left
style B text-align:left
style C text-align:left
4. Attention Mask Creation
Concept
Tell the model which tokens to attend to (1=real token, 0=padding).
Example
import torch
attention_mask = torch.tensor([attention_mask]) # From Step 2
print("Attention Mask Tensor:")
print(attention_mask)
Output:
tensor([[1, 1, 1, 1, 1, 1, 1, 0, 0, 0]])
Visualization
graph LR
A["Tokens:</br>[The, cat, sat, on, the, mat, ., [PAD], [PAD], [PAD]]"] --> B
B["Mask:</br>[1, 1, 1, 1, 1, 1, 1, 0, 0, 0]"]
5. Segment Embeddings (for BERT-style models)
Concept
Distinguish between multiple sequences (e.g., question/answer pairs).
Example
segment_ids = [0] * max_length # Single sequence
segment_ids = torch.tensor([segment_ids])
print("Segment IDs:", segment_ids)
Output:
tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0]])
6. Final Input Preparation
Combine all components into model-ready tensors:
inputs = {
'input_ids': torch.tensor([padded_ids]),
'attention_mask': attention_mask,
'token_type_ids': segment_ids
}
7. Complete Pipeline with PyTorch
from transformers import BertTokenizer
import torch
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
def preprocess(text, max_length=10):
# Tokenization
encoding = tokenizer(
text,
max_length=max_length,
padding='max_length',
truncation=True,
return_tensors='pt'
)
# Add positional encoding (usually handled internally)
return {
'input_ids': encoding['input_ids'],
'attention_mask': encoding['attention_mask'],
'token_type_ids': encoding['token_type_ids']
}
# Example usage
inputs = preprocess("The cat sat on the mat.")
print("Processed Inputs:")
print(inputs)
8. Attention Mechanism Implementation
Scaled Dot-Product Attention
def scaled_dot_product_attention(Q, K, V, mask=None):
d_k = Q.size(-1)
scores = torch.matmul(Q, K.transpose(-2, -1)) / torch.sqrt(torch.tensor(d_k))
if mask is not None:
scores = scores.masked_fill(mask == 0, -1e9)
attention = torch.softmax(scores, dim=-1)
output = torch.matmul(attention, V)
return output
# Example usage
d_model = 512
seq_len = 10
Q = torch.randn(1, 8, seq_len, d_model//8) # (batch, heads, seq_len, head_dim)
K = torch.randn(1, 8, seq_len, d_model//8)
V = torch.randn(1, 8, seq_len, d_model//8)
attention_output = scaled_dot_product_attention(Q, K, V)
print("Attention Output Shape:", attention_output.shape)
9. Complete Transformer Block
class TransformerBlock(nn.Module):
def __init__(self, d_model, n_heads):
super().__init__()
self.attention = nn.MultiheadAttention(d_model, n_heads)
self.norm1 = nn.LayerNorm(d_model)
self.ff = nn.Sequential(
nn.Linear(d_model, 4*d_model),
nn.ReLU(),
nn.Linear(4*d_model, d_model)
self.norm2 = nn.LayerNorm(d_model)
def forward(self, x, mask):
attn_output, _ = self.attention(x, x, x, key_padding_mask=mask)
x = self.norm1(x + attn_output)
ff_output = self.ff(x)
return self.norm2(x + ff_output)
Visual Summary of Full Pipeline
graph TD
A[Raw Text] --> B[Tokenization]
B --> C[Token IDs + Special Tokens]
C --> D[Padding/Truncation]
D --> E[Input IDs + Attention Mask + Segment IDs]
E --> F[Token Embeddings + Positional Encoding]
F --> G[Transformer Encoder Layers]
G --> H[Multi-Head Attention]
H --> I[Output Representations]
This complete pipeline shows how text gets transformed into the numerical representations that Transformer models use to compute attention and make predictions.