Life of a Sequence

Tracing how a sequence moves through a Large Language Model.

May 10, 2025

Motivation

Understanding transformers from the inside out is essential for AI engineers in 2025. When I joined Google’s search team in 2019, I found an invaluable resource: “Life of a Query.” This guide traced a search request through each system stage, showing how the complex machinery worked as a whole rather than isolated parts.

This approach was transformative. By following a query’s journey, I could pinpoint performance bottlenecks and their solutions. Despite search’s complexity, this mental model made it accessible.

Transformers demand equally deep understanding. While excellent resources exist - like The Annotated Transformer by Sasha Rush, The Illustrated Transformer by Jay Alammar, and Andrej Karpathy’s Let’s build GPT: from scratch, in code, spelled out - I found no comprehensive “Life of a Sequence” guide.

This is my attempt to fill that gap - a complete journey through every layer of a language model. Created with language models (poetic, isn’t it?), this guide aims to make transformer architecture transparent. If you spot any errors, please let me know.

A Colab version is available here.

Preliminaries

We’ll trace two sequences through a transformer to understand how batch processing works. We’ll track these dimensions throughout:

b: batch_size
t: sequence_length
vv: vocab_size
d: d_model (embedding dimension)

Our example sentences:

sentence1 = ' “  in   1985 ,   Miuccia   prada  unveiled   the Nylon BACKPACK that  transformed  Luxury fashion. ” '

sentence2 = " photosynthesis is the process by which green plants convert sunlight into chemical energy. "

0. Text Normalization (Optional)

Modern language models often handle text normalization internally, but it’s important to understand what happens under the hood. Here are common issues and their solutions:

Issue TypeExample in SentenceHow to Handle It
Inconsistent spacing” in 1985 , Miuccia”Collapse multiple spaces and replace with a marker (▁)
Unicode punctuation“ ” instead of Normalize using Unicode NFKC or custom ruleset

Here’s how we can normalize text manually:

import unicodedata
import re

def normalize(text: str) -> str:
    # 1. Apply Unicode normalization (NFKC handles most punctuation + ligatures)
    text = unicodedata.normalize("NFKC", text)

    # 2. Replace curly quotes with ASCII quotes (optional, explicit)
    text = text.replace("“", "\"").replace("”", "\"")

    # 3. Collapse multiple spaces into a single space
    text = re.sub(r"\s+", " ", text.strip())

    # 4. Replace spaces with the special marker ▁
    text = "▁" + text.replace(" ", "▁")

    return text

print(normalize(sentence1))
print(normalize(sentence2))

# ▁"▁in▁1985▁,▁Miuccia▁prada▁unveiled▁the▁Nylon▁BACKPACK▁that▁transformed▁Luxury▁fashion.▁"
# ▁photosynthesis▁is▁the▁process▁by▁which▁green▁plants▁convert▁sunlight▁into▁chemical▁energy.

In modern systems, this normalization step is usually handled automatically by the tokenizer.

1. Tokenization

Tokenization is the critical first step where raw text is converted into a format the model can process. This transformation maps text into a sequence of integer token IDs that correspond to entries in the model’s vocabulary.

Why Tokenization Matters

Simple tokenization methods like splitting text by spaces or using a fixed vocabulary have significant limitations:

  • They can’t handle unseen words (the out-of-vocabulary problem)
  • They struggle with morphologically rich languages
  • They don’t capture subword information efficiently

Modern language models use subword tokenization algorithms that balance vocabulary size with the ability to represent any text. The primary one used in state-of-the-art models is Byte-Pair Encoding (BPE).

SentencePiece with BPE

SentencePiece is a popular tokenization library that implements BPE among other algorithms. BPE works by:

  1. Starting with characters as the basic units
  2. Iteratively merging the most frequent adjacent pairs to form new tokens
  3. Building a vocabulary of subword units based on frequency statistics

This approach can handle unseen words by decomposing them into known subword units, making it robust across languages and domains.

Let’s see it in action with our example sentences:

import sentencepiece as spm

# Create a tiny corpus from our example sentences
with open("toy_corpus.txt", "w") as f:
    f.write(sentence1 + "\n")
    f.write(sentence2 + "\n")

vv = 100

# Train a BPE tokenizer model
# - vocab_size=100: Limits vocabulary to 100 tokens
# - model_type='bpe': Uses Byte-Pair Encoding algorithm
spm.SentencePieceTrainer.train(
    input='toy_corpus.txt',
    model_prefix='bpe_demo',
    vocab_size=vv,
    model_type='bpe'
)

# Load the trained tokenizer
sp = spm.SentencePieceProcessor()
sp.load('bpe_demo.model')

# Convert text to token IDs
ids1 = sp.encode(sentence1, out_type=int)
ids2 = sp.encode(sentence2, out_type=int)

print("Sentence 1 tokens:", sp.encode(sentence1, out_type=str))
print("Sentence 1 ids:", ids1)
print("Sentence 2 tokens:", sp.encode(sentence2, out_type=str))
print("Sentence 2 ids:", ids2)

# Sentence 1 tokens: ['▁“', '▁i', 'n', '▁', '19', '85', '▁,', '▁', 'Miu', 'cc', 'ia', '▁p', 'ra', 'da', '▁', 'un', 'v', 'eil', 'ed', '▁the', '▁', 'Nyl', 'on', '▁B', 'ACK', 'P', 'ACK', '▁t', 'h', 'at', '▁t', 'ra', 'ns', 'fo', 'rm', 'ed', '▁', 'Lu', 'xu', 'ry', '▁f', 'as', 'hi', 'on', '.', '▁”']
# Sentence 1 ids: [55, 8, 63, 61, 19, 20, 50, 61, 57, 28, 38, 4, 14, 31, 61, 15, 84, 60, 10, 18, 61, 58, 7, 51, 17, 94, 17, 5, 66, 26, 5, 14, 43, 33, 45, 10, 61, 21, 49, 47, 52, 25, 13, 7, 78, 56]
# Sentence 2 tokens: ['▁p', 'ho', 't', 'os', 'y', 'nt', 'he', 's', 'is', '▁i', 's', '▁the', '▁p', 'ro', 'ce', 'ss', '▁', 'by', '▁w', 'hi', 'ch', '▁', 'gr', 'een', '▁p', 'la', 'nt', 's', '▁c', 'on', 'v', 'er', 't', '▁s', 'un', 'li', 'gh', 't', '▁i', 'nt', 'o', '▁c', 'he', 'm', 'ic', 'al', '▁', 'en', 'er', 'gy', '.']
# Sentence 2 ids: [4, 37, 65, 44, 74, 6, 3, 67, 40, 8, 67, 18, 4, 46, 29, 48, 61, 27, 54, 13, 30, 61, 35, 59, 4, 41, 6, 67, 16, 7, 84, 12, 65, 53, 15, 42, 34, 65, 8, 6, 69, 16, 3, 83, 39, 24, 61, 11, 12, 36, 78]

Preparing for Batch Processing

To process multiple sequences simultaneously, we need to handle varying sequence lengths. The standard approach is to pad shorter sequences to match the longest one in the batch:

import torch

# Find the maximum sequence length
b = 2
t = max(len(ids1), len(ids2))

# Pad shorter sequences with padding token (0)
ids1_padded = ids1 + [0] * (t - len(ids1))
ids2_padded = ids2 + [0] * (t - len(ids2))

# Create a tensor for batch processing
input_ids = torch.tensor([ids1_padded, ids2_padded])  # shape: (b, t)
b, t = input_ids.shape

print("input_ids.shape:", input_ids.shape)
print(f"Our dimensions are now: batch_size (b)={b}, sequence_length (t)={t}")

# input_ids.shape: torch.Size([2, 51])
# Our dimensions are now: batch_size (b)=2, sequence_length (t)=51

At this point, our sequence data is ready for the embedding layer in the next step. The tensor’s shape (b, t) = (2, 51) means we have 2 sequences, each containing 51 tokens (including padding).

2. Embedding Lookup

After tokenization, we have integer IDs representing tokens, but neural networks operate on continuous vector spaces. The embedding layer transforms these discrete token IDs into dense vector representations that capture semantic relationships between tokens.

For our example, we’ll use randomly initialized embeddings for simplicity. In practice, these embeddings are learned during training or initialized from pretrained models:

import torch
import torch.nn as nn

d = 64   # Embedding dimension, typically much larger in real models

# Create embedding layer with random initialization
embedding_layer = nn.Embedding(num_embeddings=vv, embedding_dim=d)

# Apply embedding transformation
embedded = embedding_layer(input_ids)

print(f"input_ids shape: {input_ids.shape}")
print(f"embedded shape: {embedded.shape}")

# input_ids shape: torch.Size([2, 51])
# embedded shape: torch.Size([2, 51, 64])

The dimensions change as follows:

Input: [b, t] = [2, 51]
     ↓ Embedding Lookup
Output: [b, t, d] = [2, 51, 64]

3. Positional Encoding

In transformer architectures, we need to incorporate position information since without it, sentences would be treated as bags of words. Positional encoding adds information about each token’s position in the sequence.

import math

# Sinusoidal Positional Encoding
pos = torch.arange(t, dtype=torch.float).unsqueeze(1)
two_i = torch.arange(0, d, 2).float()
div_term = torch.exp(two_i * (-math.log(10000.0) / d))

pe = torch.zeros(t, d)
pe[:, 0::2] = torch.sin(pos * div_term)
pe[:, 1::2] = torch.cos(pos * div_term)

pe = pe.unsqueeze(0).expand(b, t, d)  # (b, t, d)
x = embedded + pe  # final input embedding
print("x.shape:", x.shape)

# x.shape: torch.Size([2, 51, 64])

This step doesn’t change our matrix shape. For a deep dive on sinusoidal positional encodings, see this visualizer here.

4. Scaled Dot-Product Self-Attention

We first project x into three values: Q, K, V.

W_q = nn.Linear(d, d, bias=False)
W_k = nn.Linear(d, d, bias=False)
W_v = nn.Linear(d, d, bias=False)
W_o = nn.Linear(d, d, bias=False)

q = W_q(x)  # (b, t, d)
k = W_k(x)  # (b, t, d)
v = W_v(x)  # (b, t, d)

Each token now has:

  • A query vector: What am I looking for?
  • A key vector: What do I contain?
  • A value vector: What info do I pass on?

Then we compute scaled dot-product attention:

attn_scores = torch.matmul(q, k.transpose(-2, -1)) / math.sqrt(d)  # (b, t, t)
attn_weights = torch.softmax(attn_scores, dim=-1)                  # (b, t, t)

context = torch.matmul(attn_weights, v)  # (b, t, d)
x_attn = W_o(context)                    # (b, t, d)

print("x_attn.shape:", x_attn.shape)

# x_attn.shape: torch.Size([2, 51, 64])

We are skipping masking and dropout for simplicity here. Maybe a follow-up version will go into that. Also skipping multi-head attention.

Attention is “scaled” (divided by math.sqrt(d)) to prevent issues with large values (making attention too spiky).

5. Feed-Forward Networks

After self-attention, each token’s representation is processed through a feed-forward network (FFN). This network consists of two linear transformations with a non-linear activation function in between. The FFN helps the model learn complex patterns and relationships that might be difficult to capture through attention alone.

Here’s how the FFN works:

  1. First, we add the attention output to the original input (residual connection)
  2. Apply layer normalization to stabilize the values
  3. Pass through the FFN with an expansion factor (typically 4x)
  4. Add the FFN output back to the input (another residual connection)
# 1. Residual connection: Add attention output to original input
x = x + x_attn  # Shape: (b, t, d)

# 2. Layer normalization
ln = nn.LayerNorm(d, elementwise_affine=False)
x = ln(x)  # Shape: (b, t, d)

# 3. Feed-forward network
# The FFN typically expands the dimension by 4x before contracting back
ffn = nn.Sequential(
    nn.Linear(d, d * 4),  # Expand: d → 4d
    nn.GELU(),            # Non-linear activation
    nn.Linear(d * 4, d)   # Contract: 4d → d
)

ffn_out = ffn(x)  # Shape: (b, t, d)

# 4. Final residual connection
x = x + ffn_out  # Shape: (b, t, d)

The dimensions remain unchanged throughout this process: (b, t, d). This block (attention + FFN) can be repeated multiple times in a transformer, with each repetition called a “layer” or “block”.

6. Output Projection and Decoding

The final step is to convert the transformer’s hidden states into probabilities over the vocabulary, which allows the model to predict the next token in the sequence.

# Project hidden states to vocabulary size
output_proj = nn.Linear(d, vv, bias=False)  # d → vv dimensions
logits = output_proj(x)  # Shape: (b, t, vv)

# Convert logits to probabilities using softmax
# This gives us a probability distribution over the vocabulary for each position
probs = torch.softmax(logits, dim=-1)  # Shape: (b, t, vv)

# For autoregressive generation:
# 1. Take the last token's probabilities
last_token_probs = probs[:, -1, :]  # Shape: (b, vv)

# 2. Sample or take argmax to get the next token
next_token = torch.argmax(last_token_probs, dim=-1)  # Shape: (b,)

# e.g., tensor([80,  0])

The output probabilities represent the model’s confidence in each possible next token. During training, these probabilities are compared against the actual next tokens to compute the loss. During generation, we can either:

  • Take the most likely token (argmax)
  • Sample from the distribution (more creative, but potentially less coherent)
  • Use more sophisticated sampling methods like top-k or nucleus sampling

Here argmax version is:

print("\nPredicted Next Tokens:")
for i, token_id in enumerate(next_token):
    token = sp.id_to_piece(token_id.item())
    print(f"Sequence {i+1} next token: ID {token_id.item():3d} -> '{token}'")

# Predicted Next Tokens:
# Sequence 1 next token: ID  80 -> 'C'
# Sequence 2 next token: ID   0 -> '<unk>'

This completes the journey of a sequence through a transformer model, from raw text to token predictions.

References

  1. Attention Is All You Need
  2. The Annotated Transformer
  3. The Illustrated Transformer
  4. Let’s build GPT: from scratch, in code, spelled out
  5. SentencePiece