Preprocessing Pipeline for Transformer Models with Attention Mechanisms

1. Text Tokenization

Concept

Convert raw text into numerical tokens that the model can process.

Steps

Split text into words/subwords
Map tokens to vocabulary indices
Handle special tokens ([CLS], [SEP], [PAD])

Example

from transformers import BertTokenizer

tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

text = "The cat sat on the mat."
tokens = tokenizer.tokenize(text)
token_ids = tokenizer.convert_tokens_to_ids(tokens)

print("Original:", text)
print("Tokens:", tokens)
print("Token IDs:", token_ids)

Output:

Original: The cat sat on the mat.
Tokens: ['the', 'cat', 'sat', 'on', 'the', 'mat', '.']
Token IDs: [1996, 4937, 2038, 2006, 1996, 4812, 1012]

2. Sequence Padding/Truncation

Concept

Ensure all sequences have uniform length for batch processing.

Steps

Pad shorter sequences with [PAD] tokens
Truncate longer sequences to max length

Example

max_length = 10
padded_ids = token_ids + [tokenizer.pad_token_id] * (max_length - len(token_ids))
attention_mask = [1] * len(token_ids) + [0] * (max_length - len(token_ids))

print("Padded IDs:", padded_ids)
print("Attention Mask:", attention_mask)

Output:

Padded IDs: [1996, 4937, 2038, 2006, 1996, 4812, 1012, 0, 0, 0]
Attention Mask: [1, 1, 1, 1, 1, 1, 1, 0, 0, 0]

Visualization

graph LR
    A["Before Padding<br/>[1996, 4937, 2038, 2006, 1996, 4812, 1012]"] -->|Padding to length 10| B["After Padding<br/>[1996, 4937, 2038, 2006, 1996, 4812, 1012, 0, 0, 0]"]
    B --> C["Attention Mask<br/>[1, 1, 1, 1, 1, 1, 1, 0, 0, 0]"]

    classDef data fill:#f9ffff,stroke:#333,stroke-width:1px;
    class A,B,C data;
    style A fill:#0000aa,stroke:#0066cc
    style B fill:#0000ee,stroke:#009900
    style C fill:#ff0000,stroke:#cc0000

3. Positional Encoding

Concept

Add information about token positions since Transformers don't have inherent sequence awareness.

Formula

PE(pos,2i) = sin(pos/10000^(2i/d_model))
PE(pos,2i+1) = cos(pos/10000^(2i/d_model))

PyTorch Implementation

import torch
import math

def positional_encoding(max_len, d_model):
    position = torch.arange(max_len).unsqueeze(1)
    div_term = torch.exp(torch.arange(0, d_model, 2) * (-math.log(10000.0) / d_model)
    pe = torch.zeros(max_len, d_model)
    pe[:, 0::2] = torch.sin(position * div_term)
    pe[:, 1::2] = torch.cos(position * div_term)
    return pe

d_model = 512
pe = positional_encoding(max_length, d_model)
print("Positional Encoding Shape:", pe.shape)

Example:

Token Embeddings: [E1, E2, E3, ..., En]
Positional Encodings: [PE1, PE2, PE3, ..., PEn]
Final Input: [E1+PE1, E2+PE2, E3+PE3, ..., En+PEn]

Visualization

graph TD
    A["Token Embeddings<br/>[E₁, E₂, E₃, ..., Eₙ]"] --> C["Final Input<br/>[E₁+PE₁, E₂+PE₂, E₃+PE₃, ..., Eₙ+PEₙ]"]
    B["Positional Encodings<br/>[PE₁, PE₂, PE₃, ..., PEₙ]"] --> C

    classDef embeddings fill:#e60,stroke:#0066cc,stroke-width:2px
    classDef positional fill:#e60,stroke:#f1c232,stroke-width:2px
    classDef final fill:#e60,stroke:#82b366,stroke-width:2px

    class A embeddings
    class B positional
    class C final

    style A text-align:left
    style B text-align:left
    style C text-align:left

4. Attention Mask Creation

Concept

Tell the model which tokens to attend to (1=real token, 0=padding).

Example

import torch

attention_mask = torch.tensor([attention_mask])  # From Step 2
print("Attention Mask Tensor:")
print(attention_mask)

Output:

tensor([[1, 1, 1, 1, 1, 1, 1, 0, 0, 0]])

Visualization

graph LR
    A["Tokens:</br>[The, cat, sat, on, the, mat, ., [PAD], [PAD], [PAD]]"] --> B
    B["Mask:</br>[1,   1,   1,   1,   1,   1,  1,   0,     0,     0]"]

5. Segment Embeddings (for BERT-style models)

Concept

Distinguish between multiple sequences (e.g., question/answer pairs).

Example

segment_ids = [0] * max_length  # Single sequence
segment_ids = torch.tensor([segment_ids])
print("Segment IDs:", segment_ids)

Output:

tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0]])

6. Final Input Preparation

Combine all components into model-ready tensors:

inputs = {
    'input_ids': torch.tensor([padded_ids]),
    'attention_mask': attention_mask,
    'token_type_ids': segment_ids
}

7. Complete Pipeline with PyTorch

from transformers import BertTokenizer
import torch

tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

def preprocess(text, max_length=10):
    # Tokenization
    encoding = tokenizer(
        text,
        max_length=max_length,
        padding='max_length',
        truncation=True,
        return_tensors='pt'
    )

    # Add positional encoding (usually handled internally)
    return {
        'input_ids': encoding['input_ids'],
        'attention_mask': encoding['attention_mask'],
        'token_type_ids': encoding['token_type_ids']
    }

# Example usage
inputs = preprocess("The cat sat on the mat.")
print("Processed Inputs:")
print(inputs)

8. Attention Mechanism Implementation

Scaled Dot-Product Attention

def scaled_dot_product_attention(Q, K, V, mask=None):
    d_k = Q.size(-1)
    scores = torch.matmul(Q, K.transpose(-2, -1)) / torch.sqrt(torch.tensor(d_k))

    if mask is not None:
        scores = scores.masked_fill(mask == 0, -1e9)

    attention = torch.softmax(scores, dim=-1)
    output = torch.matmul(attention, V)
    return output

# Example usage
d_model = 512
seq_len = 10
Q = torch.randn(1, 8, seq_len, d_model//8)  # (batch, heads, seq_len, head_dim)
K = torch.randn(1, 8, seq_len, d_model//8)
V = torch.randn(1, 8, seq_len, d_model//8)

attention_output = scaled_dot_product_attention(Q, K, V)
print("Attention Output Shape:", attention_output.shape)

9. Complete Transformer Block

class TransformerBlock(nn.Module):
    def __init__(self, d_model, n_heads):
        super().__init__()
        self.attention = nn.MultiheadAttention(d_model, n_heads)
        self.norm1 = nn.LayerNorm(d_model)
        self.ff = nn.Sequential(
            nn.Linear(d_model, 4*d_model),
            nn.ReLU(),
            nn.Linear(4*d_model, d_model)
        self.norm2 = nn.LayerNorm(d_model)

    def forward(self, x, mask):
        attn_output, _ = self.attention(x, x, x, key_padding_mask=mask)
        x = self.norm1(x + attn_output)
        ff_output = self.ff(x)
        return self.norm2(x + ff_output)

Visual Summary of Full Pipeline

graph TD
    A[Raw Text] --> B[Tokenization]
    B --> C[Token IDs + Special Tokens]
    C --> D[Padding/Truncation]
    D --> E[Input IDs + Attention Mask + Segment IDs]
    E --> F[Token Embeddings + Positional Encoding]
    F --> G[Transformer Encoder Layers]
    G --> H[Multi-Head Attention]
    H --> I[Output Representations]

This complete pipeline shows how text gets transformed into the numerical representations that Transformer models use to compute attention and make predictions.