!带你了解Attention,从MHA到DeepSeek (English)

Generated: 2026-06-23 05:47:57

---

Before we get down to business, let me share a real scene with you—

A couple days ago, I came across an article titled "Understand Transformer in Three Minutes." I clicked it. Three minutes? In three minutes you can't even explain what QKV is, let alone variants like MHA, GQA, and MLA. My blood pressure spiked. 😤

But then again, I can't really blame the writer. Attention is a tough nut to crack. I've been writing technical blogs for ten years, and I've stepped in more pitfalls than I have hair on my head. But lately DeepSeek has been blowing up, especially their MLA—everyone, inside and outside the field, is talking about it. I've been getting more questions about what makes it so good than I can keep up with.

Alright, let me lay it all out once and for all.

The goal is simple: by the time you finish reading, you'll have a clear evolutionary roadmap in your head. You'll know why Attention has been changing since 2017, and what real problem each improvement actually solved. I'll spill all the practical pitfalls I've hit and tuning experience I've gathered. One sentence: "I took the hits so you don't have to."

If you only remember one thing, let it be this: The evolution of Attention is a bloody history of fighting against memory bandwidth and computation cost! 💥

---

1. Let's Start with the Simplest Example: Looking Up a Dictionary

Before we talk about MHA, let's really nail single-head attention. Every variant that came after—multi-head, GQA, MLA, DSA, CSA/HCA—is just a modification of this formula.

1.1 QKV Is Essentially a Retrieval System

The first time I encountered Attention, I looked at those formulas and my head was full of question marks. Later, I realized the best way to understand it is to think of it as looking up a dictionary.

Think about it: "The cat sat on the mat because it was tired." When you read that last "it," your brain automatically decides whether it refers to "cat" or "mat." Break it down into three steps:

For the current position "it," form a Query: "I need to find a singular noun."
For each previous word, create a Key: "I am the singular noun 'cat'" / "I am the plural noun 'cats'" / ...
Based on the match between the Query and each Key, read the Value (semantic information) from that word.

See? That's the core metaphor of Attention: a differentiable dictionary lookup.

Query = "What am I looking for?" — emitted by the current position

Key = "What am I?" — emitted by each historical position

Value = "What information do I store?" — read out after being weighted by the Query×Key match

1.2 The Math? Actually Simple

Multiply a vector by a matrix, get a score, then take a weighted sum. Just three steps.

Let sequence length be L and hidden dimension be d. The input vector at position t is h_t.

Notation:

h_t: input vector at position t, dimension R^d
qt: Query vector at position t, dimension R^{dk}
kt: Key vector at position t, dimension R^{dk}
vt: Value vector at position t, dimension R^{dv}

Step 1: Calculate similarity scores. Take the dot product of the current Query with Keys from all historical positions:

score(s, t) = qs^T · kt

Step 2: Normalize. Softmax turns the scores into a probability distribution:

α{s,t} = softmax(score(s, t) / √dk)

Dividing by √d_k is a small trick—it prevents the dot products from getting too large and pushing softmax into the vanishing gradient zone.

Step 3: Weighted sum. Use the probability distribution to take a weighted sum of the Values:

outputs = Σt α{s,t} · vt

That's it! That's all of Attention! 😎

Only three lines of math, but when you actually implement it, a bunch of problems come up. Let's keep going.

**Pitfall 1 from experience**: d_k shouldn't be too big or too small. The standard is to set it to 64 or 128.

**Pitfall 2 from experience**: Be careful with initialization. I've had cases where bad initialization caused the attention distribution to collapse straight into a one-hot vector, and no amount of subsequent tuning could fix it. Talk about infuriating.

---

2. Standard Multi-Head Attention (MHA) — One Head Not Enough? Use Eight!

2.1 Why Do We Need Multi-Head?

Single-head attention has a problem: you're cramming all the information into one Query, one Key, and one Value. It's like asking one person to do translation, proofreading, and layout all at once—they won't do any of them well.

The idea behind multi-head is simple: let the model attend from different perspectives simultaneously.

Attention is essentially a weighted sum, which is a linear transformation. With just one set of QKV, you can only learn one linear projection. With multiple sets—multi-head attention—you can learn multiple different projections, and each head can focus on a different feature subspace.

In a word: one head can only learn one relationship; multiple heads can learn multiple relationships. That's why it's called Multi-Head Attention.

2.2 How Does It Work Mathematically?

Assume H heads (H=8 or 16).

Each head i has its own projection matrices Wi^Q, Wi^K, W_i^V. The calculation is the same as single-head:

headi = Attention(Q·Wi^Q, K·Wi^K, V·Wi^V)

Concatenate the outputs of the H heads, and then pass through an output projection:

MHA(Q, K, V) = Concat(head1, ..., headH) · W^O

2.3 My Practical Experience

The biggest pitfall I hit when I first used MHA was deciding on the number of heads.

The original paper by Vaswani et al. used 8 heads. But later I found that this number depends on the model size and the task.

A rule of thumb:

Small models (<1B parameters): 4-8 heads
Medium models (1B-10B): 8-16 heads
Large models (>10B): 16-32 heads

But more heads isn't always better. Every additional head means an extra set of QKV projection matrices, and both computation and parameters grow linearly. Also, if you have too many heads, some of them will learn the same patterns—wasted effort.

**Pothole diary**: I ran an experiment where I expanded a 1.3B model from 8 heads to 16. Downstream task metrics barely changed, but training speed dropped by 15%. Then I tried 12 heads—performance was the same as 16 heads, and speed was only about 5% slower than 8 heads. Conclusion: start with the rule of thumb, then fine-tune based on experiments. What the paper gives you is just a starting point.

2.4 MHA's Problem: It's Too Expensive!

MHA is powerful, but it's also expensive.

The problem lies in the KV Cache.

During inference in large models, every time you generate a token, you need to cache the Key and Value of all previous tokens for subsequent calculations. That's the KV Cache.

In MHA, each head has its own K and V. For H heads, a sequence of length L needs to store H×(dk + dv) dimensions per token per layer.

Let's do the math: model with 80 layers, H=32, dk=128, dv=128. Each token per layer stores 32×(128+128)=8192 dimensions. 80 layers makes 655,360 dimensions.

!带你了解Attention,从MHA到DeepSeek (English)

!带你了解Attention,从MHA到DeepSeek (English)

1. Let's Start with the Simplest Example: Looking Up a Dictionary

1.1 QKV Is Essentially a Retrieval System

1.2 The Math? Actually Simple

2. Standard Multi-Head Attention (MHA) — One Head Not Enough? Use Eight!

2.1 Why Do We Need Multi-Head?

2.2 How Does It Work Mathematically?

2.3 My Practical Experience

2.4 MHA's Problem: It's Too Expensive!

Cael Lee

Ready to get started?