大语言模型推理加速:硬件视角的解析 (English)

Generated: 2026-06-20 20:09:22

---

Let me tell you a true story.

I've been writing about AI hardware for ten years now.

From the BERT days when people scrambled for V100s like they were Spring Festival train tickets, to the current gold rush for H100s, with everyone hoarding GPUs like crazy. Scroll through any tech group, and eight out of ten questions are: "Where can I buy an H100? How much?"

But so what if you buy one?

Here's the brutal truth—the H100 you blew tens of thousands on? Most of the time, it's just putting on a show. The fans whir, but a big chunk of your memory channels sit idle, and less than half your compute lanes are actually working.

And you know that stings.

So today, let me break it down for you. Drawing from a decade of my own bloody lessons—stumbling, burning out machines, staying up all night tuning parameters—let's figure out what hardware really deserves your attention.

---

Here's my core take, right up front—

If you're dead set on GPUs for large model inference acceleration, you're heading straight into a dead end.

The smart move? Learn the temperament of each chip and divvy up the work.

Think about it: a 7B model needs 14 GB of memory at half precision. Now try a 70B model with a long context—a single card? No way. That's when piling on GPUs does nothing. Why? Because the bottleneck has quietly shifted from compute to memory bandwidth and capacity.

Last year I ran a 70B model across 8 H100s (a single card really couldn't fit it). I increased batch size from 4 to 8—guess what? Decode tokens per second got cut in half.

Why?

KVCache ate up all the HBM bandwidth. Optimizations like FlashAttention? They help, but they don't fix the root problem.

Speed gets strangled to death, plain and simple.

---

And that's where it gets interesting.

A lot of people don't know this, but FPGAs have been quietly creeping up over the last few years.

That TeLLMe team in the paper ran a ternary LLM on an AMD KV260 edge platform—power consumption just 7W. And you know how fast it went? 9 token/s!

I got chills when I saw that number.

I haven't personally burned an FPGA myself, but I dug up their paper. They laid out prefill and autoregressive decoding with 1.58-bit weights and 8-bit activations directly on the hardware.

No instruction decode overhead. No cache pollution. Deterministic latency—one second is one second.

And TerEffic? Even more impressive. They ran Llama-7B on an Alveo U280, hitting 290 token/s at just 46W.

Do you know how many watts a GPU needs for that kind of performance?

At least 300W.

For edge devices and IoT scenarios—that's a knockout blow.

---

Of course, some people will roll their eyes.

"Come on, you're a columnist, don't sell me snake oil. FPGA development is brutally hard. Why wouldn't I just stick with a GPU?"

I'll admit, you're right. Early on, I fell into the HLS trap—spent two months optimizing a matrix multiply, only to end up slower than a CPU. The frustration? I wanted to throw my keyboard across the room.

But that was then.

Now, AWQ quantization plus lightweight toolchains are mature. Running Qwen2.5 on an FPGA with a quantized model is a real example: a 0.5B model on a KV260, 5.1 token/s. Not fast, but at 6.5W, it's already incredible.

The key is—FPGAs are reconfigurable. Model changes? No need to swap hardware. Just reconfigure.

You get both flexibility and efficiency.

---

But don't let me sell you on FPGAs as a silver bullet, either.

Another heavily underestimated player is the CPU.

Last year, when I wrote "CPU-GPU Co-Acceleration for Large Model Inference," I tested Intel's fourth-gen Sapphire Rapids with AMX. My initial thought was simple: CPUs are slow, they're just memory warehouses.

But those AMX matrix acceleration units—they really mean business.

The IEEE paper I referenced classified OPT-30B layers by compute intensity and memory demand into two types: high-memory, low-compute layers go to the CPU.

The result? PCIe data transfer dropped dramatically. Single-inference latency improved by 12.1x, and throughput increased by 5.4x.

Annoying, right?

I had to reproduce it myself.

At first, I went quick and dirty—offloaded everything to the CPU. It became so slow I thought I'd written a trojan.

Then I analyzed it: attention layers are best on the CPU—they need tons of memory access but simple computation. MLP matrix multiplies? Keep them on the GPU. Combined with DeepSpeed-Inference's hybrid parallelism, a single machine with 8 A10 cards managed throughput close to 4 H100s.

Saved a ton on cost.

My friends in the group asked if I'd pulled off some kind of magic—

It's not magic. It's knowing the hardware.

---

Let me tell you something else I only got right earlier this year: speculative decoding.

The classic pitfall? Grabbing some random 500M model as a draft for a 7B model—acceptance rate barely 30%, worse than just running directly. I was baffled. Does this even work?

Then I swapped to a small model from the same family, with the same tokenizer—like a 1.5B draft for a 13B target, with a k=4 sampling window. Performance doubled.

What does this tell you?

Acceleration isn't about piling on parameters. It's about figuring out where the bottleneck in your compute graph really is. Memory bandwidth? Latency? Utilization?

Different scenarios, completely different solutions. Don't kid yourself into thinking one approach fits all.

---

Someone might push back: "You've talked a lot, but FPGAs aren't widely used, and CPUs are obviously slower than GPUs. Isn't this all just theory?"

I admit it: generalizability is FPGA's weakness. With an H100, any quantization tool gets you running in half an hour. With an FPGA, you're writing RTL or HLS, dealing with different boards, spending a whole day just

大语言模型推理加速:硬件视角的解析 (English)

大语言模型推理加速:硬件视角的解析 (English)

Cael Lee

Ready to get started?