Mixture of ExpertsMoE学习 (English)

Generated: 2026-06-20 18:10:50

---

Three months ago, I was hammering away at my keyboard, watching a line of text spin on the screen: How exactly does MoE save compute?

At first, I thought it was dead simple—just split a big FFN into several small FFNs, done. But the training immediately crashed, and the loss shot up like a kite with a broken string. I spent three all-nighters wrestling with papers, got knocked awake by Su Shen’s geometric explanation, and tested until my neck stiffened on 8× A100s. Today, I’ll spill all the pitfalls I ran into. I bet you’ve hit some of them too.

Let’s start with something counterintuitive: MoE doesn’t save memory—it saves computation. DeepSeek-V3 has 671B total parameters, but each inference activates only 37B. The parameter count jumps nearly twentyfold, while the compute only adds a fraction. The core operation is this: the original “universal old expert” FFN gets split into 256 small experts, each covering a specialized direction. For every token, the Router picks only the top-8 relevant experts to work; the other 248 get to sit back. Just like in the office—when a task comes up, only the relevant people get called in; everyone else keeps scrolling on their phones.

The Router sounds simple, but it blows up as soon as you try it. It has to compute an alignment score for each expert and pick the top k. The problem is that the Router’s logits have insanely high variance, often collapsing into one-hot, turning top-k into hard selection and cutting off all gradients. You stare as the loss flies away, clueless about the cause. Eventually, I dug into the DeepSeek source code and found they added one line of z-loss constraint: z_loss mean(exp(logits)) * 2. After adding it, the loss became stable as an old dog. The first time I read the paper, I didn’t even notice that line—only after stepping on the landmine myself did it burn into my bones.

So the truth about saving compute is this: You can only save FLOPs; the memory budget doesn’t get a penny off. The 671B parameters still need to live in memory—you just compute only the 37B matrices each time. Don’t interpret that as “you can run big models even on weak GPUs”—wake up, you won’t save a cent on what you have to spend.

And that naturally leads to the next question: When the parameter count blows past a single card, TP or EP?

My initial thought was naive: the MoE layer has N experts, so put different experts on different cards—easy, right? That’s expert parallelism (EP). But as soon as I tried, I realized MoE layers also come with Attention layers, which need tensor parallelism (TP) to split. The two things tangled together, and the communication graph blew up in a hair-pulling mess.

I couldn’t resist building a small-scale prototype based on an MoE architecture similar to DeepSeek-V3, and compared two approaches on 8× A100s:

Plan A: pure TP-8—each expert is sliced across cards, and every card holds a shard of every expert.
Plan B: EP+TP hybrid (I jotted it down as AD at the time, still working out the details).

What happened? Plan A had ridiculously high communication overhead on the expert layers. Plan B hit a new wall with load balancing. The final takeaway: EP and TP aren’t an either-or choice—you have to mix them. Do TP on the Attention layers and EP on the expert layers, but the sharding strategy has to be tuned to your number of experts and number of cards. There’s no silver bullet.

As for why load balancing always goes south—from aux loss to DeepSeek’s alternative approach, and the concrete code evolution from BasicMoE to DeepSeek—those are traps I fell into even harder, and I can’t finish today. Hit like, and next time I’ll write a dedicated follow-up, spilling all the geometric intuition behind load balancing and the five versions of routing code evolution.

Finally, here’s a sentence you can screenshot and take away:

MoE isn’t magic—it’s the art of engineering. What it saves isn’t memory, it’s your perspective: from one person carrying the whole load to a team of experts sharing it. The biggest trap isn’t the technology—it’s thinking you already understand.

Mixture of ExpertsMoE学习 (English)

Mixture of ExpertsMoE学习 (English)

Cael Lee

Ready to get started?