Transformer & Bert 相关问题复盘及 (English)
Transformer & Bert 相关问题复盘及 (English)
Generated: 2026-06-20 18:39:41
---
Have you ever had that kind of interview? After three months of fall recruitment, I was so sick of answering Transformer and BERT trivia that I could've thrown up. And it wasn't just me—my labmates grinding through the same questions were all complaining too: Why does this stuff just get deeper and deeper? The answers you find online are the same few lines repeated over and over, but when you actually try to explain it yourself, you freeze up on the spot.
So today, I'm not going to talk nonsense. I'll break down the traps I fell into, the experiments I ran myself, and those scalp-tingling follow-up questions from interviewers—like peeling an onion, layer by layer. By the time you finish reading, you'll realize that those "frequently asked questions" actually all share the same underlying logic.
---
Positional Encoding: The Interviewer Hit Me with Three Questions in a Row, and I Completely Crashed
ByteDance's first round. The interviewer fired off three questions, and I was drenched in sweat.
- Why does Transformer need positional encoding?
- Why use sinusoidal functions instead of learnable ones?
- Self-Attention loses relative position information, so what's the point of adding it anyway?
I could barely handle the first two, but the third one completely stumped me—I only remembered that "position information disappears in Attention," but I'd never thought about "why bother adding it at all if it disappears."
First, why it's needed.
Transformer has no recurrent structure. For input like "I hit you" and "you hit me," without positional encoding, the model would just see a bag of three words. It can't tell whether "hit" is the second word or the third. The sinusoidal formula gives each position a unique "fingerprint" at different frequencies, letting the model know that order exists.
So why not use learnable embeddings?
I dug into a lot of blog posts on this later and even ran my own comparison experiments—replacing the sinusoidal encoding with learnable embeddings. On WMT English-German, the BLEU score difference was less than 0.3, but convergence was noticeably slower. The biggest advantage of sinusoidal encoding is that it requires no extra parameters, and in theory it can even encode relative positions. But after going through Self-Attention's weighting, that positional information does get diluted—like ink poured into the ocean.
So what's the point of adding it at all?
Later it clicked: Self-Attention can attend to any position, but it needs an initial compass. Without positional encoding, the model can't even figure out "which token is the first." Positional encoding isn't about preserving order in the final representation; it's about guiding attention to develop ordering dependencies in the early layers. Even though the signal blurs in later layers, the gradient can still propagate that "order sensitivity" back. It's like training a puppy with a hand gesture first—later the gesture fades, but the conditioned reflex is already there.
In a nutshell: It's not there to "preserve" order; it's there to "kickstart" it.
---
BERT, GPT, Transformer: Split the Architecture in Half, and Each Has Its Own Achilles' Heel
Interviewers love to ask: "Why does BERT only use the Encoder, and GPT only the Decoder? Could you swap them?"
The first time I was asked, without thinking I answered: "BERT is for understanding, GPT is for generation." Then the interviewer followed up: "If you put a language model head on BERT's Encoder, could it generate text?" I was completely stuck.
Later, I drew out all three architectures on paper and studied them over and over. It comes down to just two words: field of view.
| Model | Architecture | Attention Visibility | Pre-training Task | Typical Use Case |
|---|
| Transformer | Encoder-Decoder | Self + Masked Self + Cross | Translation (conditional generation) | Machine translation, summarization |
|---|
| BERT | Encoder only | Bidirectional, full visibility | MLM + NSP | Classification, NER, QA |
|---|
| GPT | Decoder only | Left-to-right, unidirectional | Next token prediction | Dialogue, creative writing |
|---|
Cael Lee
Full-stack developer with 8+ years of experience. Currently building AI-powered developer tools. I've tested 20+ AI API providers and coding assistants.