Attention from First Principles - 1

The foundations - self attention & scaled dot product attention.

Jan 22, 2026

0:00

-2:17

Understanding how transformers work starts with one question: How does a word know what it means? The word “bank” in “river bank” means something different than “bank” in “I need a loan from the bank.” Static word embeddings can’t capture this — they’re frozen, context-blind.

Attention solves this by letting each word dynamically gather information from its surroundings. It’s the mechanism that makes modern language models work, from BERT to GPT to the latest frontier models.

Attention solves this by letting each word dynamically gather information from its surroundings

In this series of posts, we’ll build attention from scratch — starting with the simplest possible approach and evolving through each innovation: scaled dot-product attention, multi-head attention, causal masking, and finally the efficiency optimizations (GQA and MLA) that make long-context models feasible.

No hand-waving. Just the math, the intuition, and working PyTorch code.

Here’s a lay of the land of what all I will be covering in this article:

The Context Problem: Why static embeddings fail
The Naive Solution: Fixed linear combinations of neighbors
Making It Dynamic: Dot products as similarity
Understanding the Roles of in X₁ and Xᵢ in the Dot Product
Adding Flexibility: Learnable Projections (Q, K,V)
Self-Attention Complete: Putting It All Together
Scaled Dot-Product Attention

The Context Problem

Consider this sentence:

“The bank approved my loan application.”

Now this one:

“I sat by the river bank watching the sunset.”

Same word. Completely different meanings.

If we represent “bank” as a single static vector — say, a 300-dimensional embedding from Word2Vec or GloVe — we have a problem. That vector can’t simultaneously capture “financial institution” and “edge of a river.” It’s frozen, context-blind.

The question: How do we make a word’s representation depend on what surrounds it?

Context determines meaning in nearly every sentence we write. “Apple” near “iPhone” means something different than “apple” near “pie.” “Pitch” changes meaning depending on whether we’re talking about baseball, music, or startups.

We need a mechanism that looks at a word’s neighbors and adjusts its representation accordingly. Not just once, but dynamically — different contexts should produce different representations.

How do we make a word’s representation depend on what surrounds it?

Let’s formalize the problem:

You have a sentence represented as vectors: X₁, X₂, X₃, X₄

Goal: Update X₁ to X’₁, a context-aware version of the same word.

Why? Because X’₁ should capture what surrounds it. In “river bank,” X’₁ should lean toward geographical meaning. In “money bank,” it should lean financial.

The mechanism we’re about to build — attention — is the answer to this problem. But we’ll get there step by step, starting with the simplest possible approach.

The Naive Solution: Fixed Linear Combination

The simplest idea: mix X₁ with its neighbors.

X’₁ = a₁X₁ + a₂X₂ + a₃X₃ + a₄X₄

We’re creating a weighted average. Each neighbor contributes to the new representation based on some importance weight aᵢ.

Why this makes sense:

If X₃ contains relevant context (say, “river”), we want X₁ to absorb some of that information
The weights aᵢ control how much each token influences X₁
X’₁ becomes a blend — no longer just “bank,” but “bank in this specific context”

But where do these weights come from?

We could hand-craft them. Maybe nearby words get higher weights (a₂ = 0.4, a₃ = 0.3) and distant ones get lower weights (a₄ = 0.1). Or we could make all weights equal (uniform averaging).

The problem: Fixed weights can’t adapt.

In “bank approved loan,” we want high weight on “loan” and “approved.” In “river bank,” we want high weight on “river.” A single set of fixed weights can’t handle both cases.

Mental Model:

Fixed weights = a rigid recipe. We need weights that change based on what’s in the sentence.

This brings us to the key insight: the weights themselves need to be computed dynamically.

Making It Dynamic: The Dot Product Insight

How do we compute weights that adapt to context?

Key idea: Use similarity.

If X₃ is semantically similar to X₁, it probably contains relevant context. If it’s unrelated, we should ignore it.

How do we measure similarity between vectors?

The dot product: X₁ · Xᵢ

When two vectors point in similar directions (share semantic features), their dot product is high. When they’re orthogonal (unrelated), it’s near zero.

So instead of fixed weights, we compute:

aᵢ = X₁ · Xᵢ    (for i = 1 to 4)

Now our context-aware representation becomes:

X’₁ = (X₁ · X₁)X₁ + (X₁ · X₂)X₂ + (X₁ · X₃)X₃ + (X₁ · X₄)X₄

Why this works:

If X₁ · X₃ is high, then X₃ probably shares important context with X₁
The weight is computed automatically based on the actual content of the vectors
Different sentences produce different weights — exactly what we wanted

Mental Model:

The dot product measures alignment. High alignment = high relevance = high weight.

This is early self-attention intuition: attend more to similar tokens.

The dot product measures alignment. High alignment = high relevance = high weight.

But there’s still a limitation: Raw dot products might not be flexible enough. Word embeddings are fixed — what if the kind of similarity we need for attention isn’t the same as the similarity baked into the embeddings?

what if the kind of similarity we need for attention isn’t the same as the similarity baked into the embeddings?

Understanding the Roles of in X₁ and Xᵢ in the Dot Product

Look closely at what’s happening in our similarity computation:

aᵢ = X₁ · Xᵢ

X₁ and Xᵢ are playing different roles here:

X₁ is asking: “What am I looking for? What context do I need?”
Xᵢ is answering: “Here’s what I have to offer as context.”

This is asymmetric. X₁ is the query — the token seeking information. Xᵢ is the key — advertising what information it contains.

This is asymmetric. X₁ is the query — the token seeking information. Xᵢ is the key — advertising what information it contains.

Think about the word “bank” in our sentence again. When we compute:

a₃ = X₁ · X₃

X₁ (the “bank” vector) is doing two jobs:

As a query (asking): “I’m ‘bank’ — do you, X₃, have context that helps clarify my meaning?”
As a key (offering): When X₃ wants to update itself, it computes X₃ · X₁, and now X₁ is saying “Here’s what I can offer you as context.”

Same vector. Two different conversations.

The problem: Maybe the features that make “bank” a good searcher (what helps it find relevant context) are different from the features that make it a good context provider (what it offers to others).

It’s like using the same resume when you’re hiring someone vs. when you’re applying for a job. The information you care about is different in each case.

And there’s a third role we haven’t separated yet:

Once we decide Xᵢ is relevant (high weight), what information do we actually extract from it? Right now, we’re taking the entire vector Xᵢ. But maybe we only need certain aspects of it — the value it provides.

Mental Model:

Query = “What do I need?”
Key = “What do I have?”
Value = “Here’s what I’ll give you.”

Currently, all three roles are crammed into the same vector. What if we let the model learn separate representations for each role?

What if we let the model learn separate representations for each role?

Adding Flexibility: Learnable Projections (Q, K, V)

Raw dot products have a fundamental limitation: they measure similarity in the original embedding space.

But what if the kind of similarity we need for attention is different from the similarity already encoded in word embeddings?

Example:

Word embeddings might group “king” and “queen” as similar (both royalty)
But for attention in a specific context, we might care about grammatical role (subject vs. object) or semantic function (agent vs. patient)
The embedding space wasn’t designed for this specific task

The insight: Project into a learned space where similarity means what we need it to mean.

Instead of using raw vectors, we introduce three learnable transformations:

Q₁ = WQ X₁    (Query: what token 1 is looking for)
Kᵢ = WK Xᵢ    (Key: what token i offers)
Vᵢ = WV Xᵢ    (Value: the actual content to mix in)

Project into a learned space where similarity means what we need it to mean.

Instead of using raw vectors, we introduce three learnable transformations:

Q₁ = WQ X₁    (Query: what token 1 is looking for)
Kᵢ = WK Xᵢ    (Key: what token i offers)
Vᵢ = WV Xᵢ    (Value: the actual content to mix in)

Why three separate projections?

Query (Q): “What am I looking for?” — represents the information needs of the current token
Key (K): “What do I offer?” — represents what each token can provide
Value (V): “What content do I actually contribute?” — the information that gets mixed

The separation is crucial:

Similarity (Q · K) is computed in one space (query-key space)
Content mixing (weighted sum of V) happens with potentially different representations
This decoupling gives the model flexibility to learn: “attend based on X, but retrieve based on Y”

Now instead of X₁ · Xᵢ, we use:

AttentionScore₁,ᵢ = Q₁ · Xᵢ

Why projections help:

WQ, WK, WV are learned during training
The model discovers what kind of similarity matters for the task
For syntax tasks, it might learn to project grammatical features
For semantic tasks, it might learn meaning-based projections
The same embeddings can be projected differently depending on what the attention needs to capture

Insight: Projections create a custom space where “similar” means “relevant for this attention operation,” not just “semantically close in general.”

Why not just use the original vectors X for mixing?

Think about it this way: the attention mechanism needs to solve two different problems:

Finding relevance (which tokens to pay attention to)
Extracting useful information (what to actually take from those tokens)

These aren’t necessarily the same thing!

Example scenario:

For finding relevance, you might want to emphasize syntactic features (“is this a verb? a noun?”)
But for the actual content you mix in, you might want semantic meaning (“what does this word contribute to the sentence meaning?”)

The V projection lets the model learn: “Here’s the useful representation to pass forward,” which might be different from both:

The raw embedding X
The key representation K (used for matching)

Concrete analogy: Imagine a library search:

Query (Q): “I’m looking for books about neural networks”
Key (K): Book titles and tags (used to match your query)
Value (V): The actual book content you get when there’s a match

The title/tags (K) help you find relevant books, but the content (V) is what you actually read. They serve different purposes.

               ┌───────────────────────────┐
               │   Input Embeddings (X)    │
               │ (general token features)  │
               └─────────────┬─────────────┘
                             │
          ┌──────────────────┴──────────────────┐
          │                                     │
     Project to Q                          Project to K
(“What am I looking for?”)        (“What features identify relevance?”)
          │                                     │
          └──────────────┬─────────────────────┘
                         │  similarity scores
                         ▼
                 Attention Weights
                         │
                         ▼
                   Project to V
    (“What useful meaning should I extract and mix?”)
                         │
                         ▼
                     Output Mix
     (Weighted combination of useful info only)

Analogy — Searching a library

User Query (Q): “Books about neural networks”
         │
         ▼
 Match against:
   Book Titles & Tags (K)
  ────────────────────────
  - “Deep Learning”
  - “Neural Networks”
  - “Neuroscience”
  - “Cooking with Neural Nets” 😄
  ────────────────────────
         │  (attention scores)
         ▼
 Retrieve:
   Actual Book Content (V)
  ────────────────────────
  - chapters
  - explanations
  - diagrams
  ────────────────────────
         │
         ▼
 You read the books! → Final output representation

Self-Attention Complete: Putting It All Together

Now we have all the pieces. Let’s assemble the full self-attention mechanism.

The complete formula:

For each token i, we compute its context-aware representation:

X’ᵢ = Σⱼ softmax(Qᵢ · Kⱼ) · Vⱼ

Why Softmax?

Raw dot products Qᵢ · Kⱼ can be any value—positive, negative, large, small. We need to convert them into proper attention weights that:

Are all positive (no negative attention)
Sum to 1 (proper probability distribution)
Emphasize high scores and suppress low ones

w₁,ᵢ = softmax(Q₁ · Kᵢ) = exp(Q₁ · Kᵢ) / Σⱼ exp(Q₁ · Kⱼ)

What softmax does:

Converts scores into probabilities (0 to 1, sum to 1)
Creates a sharp focus: high scores get most of the weight
Makes the model easier to train (stable gradients)

Mental Model:

Softmax turns attention scores into a weighted average recipe — ensuring we’re mixing values in a balanced, interpretable way.
Softmax turns attention scores into a weighted average recipe

The Complete Self-Attention Step

For token i:
1. Compute query: Qᵢ = WQ Xᵢ
2. Compute keys for all tokens: Kⱼ = WK Xⱼ (for all j)
3. Compute values for all tokens: Vⱼ = WV Xⱼ (for all j)
4. Calculate attention scores: scoreᵢ,ⱼ = Qᵢ · Kⱼ
5. Normalize with softmax: wᵢ,ⱼ = softmax(scoreᵢ,ⱼ)
6. Mix values: X’ᵢ = Σⱼ wᵢ,ⱼ Vⱼ

Key insight: Every token becomes both a query (looking for context) and a key/value pair (offering context to others). This is why it’s called self-attention — tokens attend to each other within the same sequence.

Let’s look at the equations and understand deeper how Q, K and V are interacting in the following section.

Understanding the Symmetries: Q, K, V Relationships

The roles of Q, K, and V have interesting symmetries that reveal how attention works:

Q and K: Symmetric operation, asymmetric roles

The attention score computation Qᵢ · Kⱼ is symmetric in form—it’s just a dot product. But the roles are asymmetric:

Qᵢ represents a single token asking: “What context do I need?”
K represents the entire sequence answering: “Here’s what we all offer”

When computing attention for token i, we have:

One query vector: Qᵢ (the current token’s question)
Multiple key vectors: K₁, K₂, …, Kₙ (every token’s advertisement)

The dot product Qᵢ · Kⱼ measures: “How well does token j’s offering match token i’s needs?”

V: Asymmetric to Q, symmetric to K

V plays a fundamentally different role than Q:

Q determines what to look for (the search)
K determines what matches (the filtering)
V determines what information flows forward (the payload)

But V shares structure with K:

Both K and V represent the entire context (all tokens)
K is used for matching; V is used for content
They’re computed from the same input tokens, just with different projections

Mental Model:

Q is the question (one token). K and V are the answers (all tokens). K helps find relevant answers. V provides the actual content from those answers.

Why this matters: When you see attention_weights @ V, remember:

The weights came from Q·K (matching the question to offerings)
But we’re mixing V (the actual content), not K
This decoupling lets the model learn: “match based on X, retrieve based on Y”

Efficiency insight: Notice that K and V depend only on the context tokens, not on which token is querying. During autoregressive generation (like in GPT), past tokens’ K and V never change — only new queries arrive. This observation leads to KV caching: we compute and store K, V once for past tokens, then reuse them for each new token. We’ll explore this optimization when we discuss inference efficiency.

Scaled Dot-Product Attention

Before we move to the complete implementation, there’s one critical detail: scaling.

When you compute dot products between Q and K, the magnitude of the result grows with dimensionality.

Specifically, if Q and K have dimension d_k with elements of roughly unit variance, then Q · K has variance approximately d_k.

The problem:

Large dot products → large inputs to softmax
Softmax with large inputs becomes extremely “sharp” (almost all weight on the max value)
This pushes the function into saturation regions with tiny gradients
Training becomes difficult or unstable

Softmax with large inputs becomes extremely “sharp” (almost all weight on the max value)

The solution: Scale by 1/√d_k

scores = (Q · K) / √d_k

This normalizes the variance back to ~1, keeping softmax in a reasonable operating range.

Reference: This comes from the original Transformer paper “Attention Is All You Need” (Vaswani et al., 2017), Section 3.2.1:

“We suspect that for large values of d_k, the dot products grow large in magnitude, pushing the softmax function into regions where it has extremely small gradients.”

Let us look at it with concrete numbers.

Example 1: Normal-sized scores (well-behaved softmax)

import torch
import torch.nn.functional as F

scores = torch.tensor([2.0, 1.0, 0.5])
weights = F.softmax(scores, dim=0)

print(weights)
# Output: [0.659, 0.242, 0.099]

Nice distribution — the highest score gets most weight, but others still contribute.

Example 2: Large scores (sharp/saturated softmax)

scores = torch.tensor([20.0, 10.0, 5.0])  # 10x larger
weights = F.softmax(scores, dim=0)
print(weights)
# Output: [0.9999, 0.0001, 0.0000]

Almost all weight goes to the max! The other tokens are essentially ignored.

Example 3: Effect of scaling

# Without scaling (d_k = 64)
large_scores = torch.tensor([50.0, 45.0, 40.0])
weights_no_scale = F.softmax(large_scores, dim=0)
print(”No scaling:”, weights_no_scale)
# Output: [0.993, 0.007, 0.000]

# With scaling
d_k = 64
scaled_scores = large_scores / (d_k ** 0.5)  # divide by 8
weights_scaled = F.softmax(scaled_scores, dim=0)
print(”With scaling:”, weights_scaled)
# Output: [0.665, 0.245, 0.090]

Scaling brings back a reasonable distribution!

Scaling brings back a reasonable distribution!

Complete Scaled Dot-Product Attention Steps

For token i:
1. Compute query: Qᵢ = WQ Xᵢ
2. Compute keys for all tokens: Kⱼ = WK Xⱼ (for all j)
3. Compute values for all tokens: Vⱼ = WV Xⱼ (for all j)
4. Calculate scaled attention scores: scoreᵢ,ⱼ = (Qᵢ · Kⱼ) / √dₖ
5. Normalize with softmax: wᵢ,ⱼ = softmax(scoreᵢ,ⱼ)
6. Mix values: X’ᵢ = Σⱼ wᵢ,ⱼ Vⱼ

Let’s look at PyTorch Implementation:

import torch
import torch.nn.functional as F

def scaled_dot_product_attention(X, W_Q, W_K, W_V):
    “”“
    X: (batch, seq_len, d_model) - input embeddings
    W_Q, W_K, W_V: (d_model, d_k) - projection matrices
    Returns: (batch, seq_len, d_k) - context-aware representations
    “”“
    # Project to Q, K, V
    Q = X @ W_Q  # (batch, seq_len, d_k)
    K = X @ W_K  # (batch, seq_len, d_k)
    V = X @ W_V  # (batch, seq_len, d_k)
    
    # Compute attention scores with scaling
    d_k = Q.size(-1)
    scores = (Q @ K.transpose(-2, -1)) / (d_k ** 0.5)  # (batch, seq_len, seq_len)
    
    # Apply softmax to get attention weights
    attention_weights = F.softmax(scores, dim=-1)  # (batch, seq_len, seq_len)
    
    # Weighted sum of values
    output = attention_weights @ V  # (batch, seq_len, d_k)
    
    return output, attention_weights

Let’s run it an see the results

# Realistic example - same dimensions
X = torch.tensor([
    [[1.0, 0.0, 0.0, 0.0],   # token 1
     [0.0, 1.0, 0.0, 0.0],   # token 2  
     [0.0, 0.0, 1.0, 0.0]]   # token 3
])

# Projection matrices maintain dimension (in practice, these are learned)
W_Q = torch.randn(4, 4)
W_K = torch.randn(4, 4)
W_V = torch.randn(4, 4)

output, weights = self_attention(X, W_Q, W_K, W_V)

print(”Attention weights:\n”, weights)
print(”\nOutput shape:”, output.shape)
print(”\nOutput:\n”, output)

Output: 259

Attention weights:
 tensor([[[0.2101, 0.0854, 0.7045],
         [0.7229, 0.1472, 0.1298],
         [0.3303, 0.3309, 0.3388]]])

Output shape: torch.Size([1, 3, 4])

Output:
 tensor([[[ 0.3942, -1.1294, -0.3138, -0.0305],
         [ 0.2267, -0.9491,  0.4525,  0.5159],
         [ 0.1883, -1.1199, -0.0303,  0.4311]]])

Understanding the Results

Attention weights:

[[0.2101, 0.0854, 0.7045],   # Token 1’s attention distribution
 [0.7229, 0.1472, 0.1298],   # Token 2’s attention distribution
 [0.3303, 0.3309, 0.3388]]   # Token 3’s attention distribution

Reading this matrix:

Token 1: Attends mostly to Token 3 (70.5%), less to itself (21%) and Token 2 (8.5%)
Token 2: Attends mostly to Token 1 (72.3%), distributes rest between itself and Token 3
Token 3: Fairly balanced attention across all three tokens (~33% each)

Output:

Output shape: torch.Size([1, 3, 4])
Output:
 [[[ 0.3942, -1.1294, -0.3138, -0.0305],   # Token 1’s context-aware representation
   [ 0.2267, -0.9491,  0.4525,  0.5159],   # Token 2’s context-aware representation
   [ 0.1883, -1.1199, -0.0303,  0.4311]]]  # Token 3’s context-aware representation

Key observations:

Shape is (1, 3, 4) = (batch, seq_len, d_model) - dimension preserved
Each token started as a 4D one-hot vector
After projection and attention, each is now a 4D context-aware representation
Token 1’s output = 0.21×V₁ + 0.09×V₂ + 0.70×V₃ (weighted by attention)

What this means: Token 1’s new representation [0.39, -1.13, -0.31, -0.03] is no longer just “token 1 in isolation”—it’s enriched with information from Token 3 (which received 70% of the attention weight). The original identity is blended with contextual information.

Parallel Computation

This example also illustrates an important difference in the way we looked at attention equation and how we are actually doing in practice. In practice, we compute attention for all tokens simultaneously, not one by one.

we compute attention for all tokens simultaneously

Notice the matrix operations in the code:

scores = Q @ K.transpose(-2, -1)  # (batch, seq_len, seq_len)

This single matrix multiplication computes attention scores for every token pair at once:

Q has shape (batch, seq_len, d_k) — all queries together
K^T has shape (batch, d_k, seq_len) — all keys together
Result: (batch, seq_len, seq_len) — every query against every key

Mental Model:

We don’t loop through tokens. One matrix multiplication gives us the entire attention pattern for the whole sequence.

This parallelization is why transformers are so efficient on GPUs — unlike RNNs that process sequentially, attention computes all relationships simultaneously.

parallelization is why transformers are so efficient on GPUs — unlike RNNs

This concludes the first part of my multi-part series where we explore Attention from fundamentals in great details, hopefully you liked it.

In our next part of this series we will cover the following:

Computational Complexity of Self-Attention
Multi-Head Attention: The Power of Multiple Perspectives
The Multi-Head Illusion: Where Does Separation Actually Happen?
Causal Self-Attention: Looking Only Backward

Stay Tuned!!

Discussion about this post

Ready for more?