I Learned Transformers by Arguing With an AI. Here's the Transcript.

Learning Transformers & Attention: A Conversation

Feb 26, 2026

This is a conversation between Claude and me where we learn about transformers and how attention mechanisms work, step by step.

Starting Point

Claude: What do you already know about transformers? Not looking for a textbook answer - just tell me what you think they are, what they do, or any fragments you’ve picked up.

Me: I know in 2017 some researchers at Google wrote the paper for transformers and they have become the bedrock for LLMs, especially Gen AI. It’s basically a predictive model which can predict large portions of text based on what it has been trained on.

Claude: Good foundation. You’ve got the “what” - now let’s get the “how.”

The Problem Transformers Solved

Before transformers, we had RNNs (Recurrent Neural Networks). They processed text one word at a time, in order:

"The cat sat" → process "The" → process "cat" → process "sat"

Two problems:

Slow - can’t parallelize, must go in sequence
Forgets - by the time you reach word 100, you’ve forgotten word 1

The transformer insight: What if we process all words at once, and let each word “look at” every other word to decide what’s important?

That’s attention.

Understanding Attention

Claude: In the sentence “The cat sat on the mat because it was tired” - what does “it” refer to? How did you figure that out?

Me: It refers to cat. I figured this out because I read the whole sentence as a whole.

Claude: Exactly. You didn’t read word-by-word forgetting what came before. You saw the whole sentence and connected “it” back to “cat” because that’s what makes sense.

That’s what attention does for the model.

For each word, the model asks: “Which other words in this sentence should I pay attention to when understanding this word?”

For “it” in that sentence:

“it” looks at “cat” → high attention (makes sense as referent)
“it” looks at “mat” → some attention (also a noun, possible referent)
“it” looks at “tired” → high attention (tells us “it” is something that can be tired - a living thing, so probably “cat” not “mat”)

The model learns these attention patterns from training data. It sees millions of sentences where “it” refers back to animals, people, or objects - and learns the patterns.

The Attention Formula (Simplified)

Every word gets three vectors:

Query (Q): “What am I looking for?”
Key (K): “What do I contain?”
Value (V): “What information do I give?”

Attention = match queries against keys, use the matches to weight the values.

Think of it like a search engine inside the model. Each word searches for relevant other words.

Self-Attention vs Cross-Attention

Claude: Why do you think they call it “self-attention” specifically?

Me: Because it can learn to pay attention on its own.

Claude: Close, but not quite.

It’s “self” because the sentence attends to itself. The queries, keys, and values all come from the same input sequence.

“The cat sat on the mat” → every word looks at every other word in the same sentence.

This is different from cross-attention, where one sequence attends to a different sequence. For example:

Translation: English sentence attends to French sentence
Image captioning: Text attends to image features

Self-attention = looking within
Cross-attention = looking at something else

Multi-Head Attention

The model doesn’t just run attention once. It runs it multiple times in parallel (typically 8-96 “heads”).

Why? Each head can learn to focus on different things:

Head 1: syntax (subject-verb relationships)
Head 2: coreference (”it” → “cat”)
Head 3: adjective-noun pairs
Head 4: position/distance

Then the results get combined.

Claude: Given what you know about embeddings - each word becomes a vector of numbers - why do you think we need multiple heads instead of just one really good attention mechanism?

Me: Because words can mean different things in different contexts, so we need the multiple heads to figure it out correctly.

Claude: Exactly right.

“Bank” needs different attention patterns depending on context:

“river bank” → Head focusing on nature/geography lights up
“bank account” → Head focusing on finance lights up

One head can’t capture all the different types of relationships. Multiple heads let the model look at the same sentence through different lenses simultaneously.

The Transformer Architecture

A transformer block has:

Multi-head self-attention - “which words matter for each word?”
Feed-forward network - processes each position independently
Layer normalization + residual connections - keeps training stable

Stack 12-96 of these blocks, and you get GPT/BERT/Claude.

Input embeddings
      ↓
┌─────────────────┐
│  Self-Attention │
├─────────────────┤
│  Feed-Forward   │
└─────────────────┘
      ↓
   (repeat 12-96x)
      ↓
Output

GPT vs BERT

Claude: You mentioned transformers are “predictive” - they predict the next word. But that’s only true for GPT-style models. BERT does something different. Based on the name (Bidirectional Encoder Representations from Transformers), what do you think BERT does differently?

Me: No clue.

Claude: Fair enough.

GPT (Generative Pre-trained Transformer):

Reads left-to-right only
Predicts the next word
“The cat sat on the ___” → predicts “mat”
Good for: generation, chatbots, writing

BERT (Bidirectional Encoder Representations from Transformers):

Reads in both directions simultaneously
Predicts masked words in the middle
“The cat [MASK] on the mat” → predicts “sat”
Good for: understanding, classification, search

The “bidirectional” part means BERT sees the whole sentence at once when predicting. GPT only sees what came before.

Why this matters practically:

Need to generate text? → GPT-style (Claude, ChatGPT)
Need to understand/classify text? → BERT-style (search rankings, sentiment analysis)
Need embeddings for RAG? → Usually BERT-style (Jina embeddings are BERT-based)

Connecting It to RAG

Claude: You built a RAG system. Your embeddings come from Jina (BERT-style). Your generation comes from an LLM (GPT-style). Why do you think we use BERT-style for embeddings and GPT-style for generation, instead of using the same model for both?

Me: For generation, you need to generate the next token. The latest token is always unknown before generation, so you can only know what has come before. For embeddings, you already have the whole text you are trying to embed, so you can read in both directions simultaneously to have context.

Claude: Spot on. You’ve got it.

Generation: the future doesn’t exist yet, so you can only look backward.

Embeddings: the whole text exists, so looking both directions gives you richer understanding.

Positional Encoding

Remember when I said transformers process all words at once? That creates a problem.

If we process all words simultaneously, how does the model know that “cat” comes before “sat”? Without order, “cat the sat” and “the cat sat” look the same.

Solution: Add position information to each word embedding.

"the"  embedding + position 1 signal
"cat"  embedding + position 2 signal
"sat"  embedding + position 3 signal

Now the model knows word order even though it processes everything in parallel.

Context Window Limits

Claude: Why do transformers have a “context window” limit (like 8K, 32K, 128K tokens)? What about the architecture makes this a constraint?

Me: Context window is the amount of tokens it can hold at the same time. I think attention is the constraint.

Claude: Right. Attention is the bottleneck.

Every token attends to every other token. That’s n² comparisons.

1,000 tokens → 1,000,000 attention calculations
10,000 tokens → 100,000,000 attention calculations
100,000 tokens → 10,000,000,000 attention calculations

Memory and compute explode quadratically. That’s why context windows have limits.

Newer models use tricks to extend context (sparse attention, sliding windows, etc.), but the fundamental n² problem is why this is hard.

Causal Masking

Claude: Why can’t GPT “see” the word it’s about to predict, even though attention lets tokens look at all other tokens?

Me: Because the token doesn’t exist yet. It can’t see what it hasn’t predicted.

Claude: Exactly. And mechanically, this is enforced with a causal mask - attention scores for future positions are set to negative infinity, so they become zero after softmax. The model literally can’t cheat.

Summary

You now understand:

Attention - words looking at other words
Self-attention vs cross-attention
Multi-head attention - multiple lenses
Transformer blocks - attention + feed-forward, stacked
GPT (left-to-right) vs BERT (bidirectional)
Positional encoding - how order is preserved
Context window limits - n² attention cost

Interview-ready version:

“Transformers use self-attention to let each token look at every other token in the input, learning which relationships matter. Multi-head attention runs this in parallel to capture different types of patterns. GPT-style models are autoregressive - they predict left-to-right for generation. BERT-style models are bidirectional - they see the whole input for understanding tasks. The n² attention cost is why context windows are limited.”

Quiz

Question 1

Claude: You’re building a system that needs to classify customer support tickets into categories (billing, technical, shipping, etc.). Would you use a GPT-style or BERT-style model for this? Why?

Me: BERT-style.

Claude: Correct. Why?

Me: Bidirectional models are better for classification.

Claude: Right. The whole ticket exists upfront - you need to understand it, not generate new text. Bidirectional gives you richer context for classification.

Question 2

Claude: A transformer has 8 attention heads. One head learns to connect pronouns to their referents (”it” → “cat”). What might another head be learning?

Me: Syntax (subject-verb relationships), coreference (”it” → “cat”), adjective-noun pairs, position/distance.

Claude: Good - you remembered the examples. That’s the right idea: each head specializes in different linguistic patterns.

Question 3

Claude: You double your context window from 4K to 8K tokens. Roughly how much more compute does attention require?

Me: Roughly double.

Claude: Not quite.

Attention is n².

4K tokens → 4K × 4K = 16 million operations
8K tokens → 8K × 8K = 64 million operations

Double the tokens = 4x the compute.

This is why context window expansion is expensive, and why companies charge more for longer contexts.

Question 4

Claude: Your embedding model produces the same vector for these two sentences:

“The bank was steep”
“The steep was bank”

What’s missing from the model?

Me: No positional encoding.

Claude: Correct. Without positional encoding, word order is invisible. Both sentences have the same words, so they’d produce identical embeddings.

Question 5

Claude: In a translation model (English → French), the French decoder needs to look at the English input while generating. Is this self-attention or cross-attention?

Me: This is cross-attention.

Claude: Correct. Two different sequences - English input and French output. The decoder attends across to the encoder’s output.

Question 6

Claude: Why can’t GPT “see” the word it’s about to predict, even though attention lets tokens look at all other tokens?

Me: Because the token doesn’t exist yet. It can’t see what it hasn’t predicted.

Claude: Exactly. And mechanically, this is enforced with a causal mask that sets future attention scores to negative infinity.

Results: 5/6

Missed the n² scaling question, but got it after the explanation.

Solid on:

GPT vs BERT and when to use each
Multi-head attention purpose
Positional encoding
Self vs cross attention
Why GPT can’t see future tokens

Olorunnifemi Oluniyi

Mar 3

Just finished reading this.

Nice Insight and what tools are available in today's market to make use of BERT models as opposed to GPT.

1 reply by Killy

Feb 26

Unrelated but looking back at it, your becoming a Software Engineer was probably my biggest motivation.

Keep Inspiring.

2 more comments...

Wherever This Goes

Discussion about this post

Ready for more?