NLP预训练模型(2021版) (English)

Generated: 2026-06-20 09:16:59

In 2021, NLP Pretrained Models Were on Fire, But Engineers Were in Tears

Believe it or not—last year I had dinner with a Baidu NLP engineer. He was buried in model training headaches, and while we were talking, he nearly spilled his coffee all over the keyboard.

He sighed, "It takes a whole month just to complete one round of iteration. GPU memory keeps blowing up. The old LSTM model in production has already been fed tens of millions of samples. The new model runs for half a day with hardly any improvement in sight, but the costs have already multiplied."

You see, back in 2018 when BERT first came out, the entire NLP community went wild. That "pretrain + fine-tune" paradigm was like a shot of adrenaline for natural language processing. By 2021, the tone had shifted: people stopped asking "can we use it?" and started asking "how can we make it actually work well?" From a technology explosion to engineering growing pains, that year saw models getting bigger and stronger—but the tears on the ground multiplied too.

Let’s walk through what really happened in 2021.

Bigger Is Better, Right? Then It Got Schooled by an LSTM

Think about it: Baidu’s ERNIE series first shouted "knowledge masking" in 2019, and by 2021 it had already iterated to version 3.0. Capable of handling NLP, cross-modal, speech—sounds versatile, doesn’t it? How dominant!

But when it came to actual deployment? Problems piled up one after another in dialogue systems: training was excruciatingly slow. Business demanded biweekly iterations, but the model took a month per cycle. GPU memory exploded at every turn. Inference latency was ridiculous. And the most painful part: the LSTM still running in production, trained on tens of millions of samples, was already rock solid. After all the fuss over this shiny new pretrained model, the improvement was almost invisible.

In other words, it crushed benchmarks on paper, but under the constraints of engineering, it was a big, useless beast—nice to look at, but not fit for purpose.

Did you think BERT was the finish line? Wrong! The real finish line is throwing your model into a real business, where it has to handle concurrent traffic, afford the compute, and run fast. In 2021, that contrast was painfully stark.

One Brain, 96 Languages! But Then Came the Problems

2021 was the year of multilingual models. Look at the lineup:

Baidu released ERNIE-M, jointly learning 96 languages—one model that understands 96 languages at once, setting new state-of-the-art scores on five types of cross-lingual understanding tasks.

Google wasn’t idle either, releasing mT5—the multilingual version of T5—pretrained on CommonCrawl covering 101 languages. They even added a special "accidental translation prevention" mechanism to keep the model from suddenly blurting out other languages in non-translation tasks.

One took the knowledge‑enhancement path, the other the scaling‑up path. Two different legs, but the same goal: stop serving only English users and use one set of parameters to serve the world.

You might think, "Wow, amazing!" But hold on—once models grow bigger and more complex, the engineering pitfalls come along for the ride. We’ll get to that.

General? Better to Specialize! Millions of Dialogue Tokens Beat Billions in General Data

Starting around 2020, people working on task‑oriented dialogue noticed something counterintuitive: using a general pretrained model for a dialogue system might actually be worse than training from scratch on dialogue data alone.

In 2021, this insight became more systematic. Research pointed out that BERT shines on open‑domain text, but dialogue has interactive structure, system actions, user intents—things that general pretraining never covers. The comparison was jaw‑dropping: in certain scenarios, a model pretrained on millions of dialogue utterances outperformed—after fine‑tuning—a general model that was ten times larger!

What does that tell us? Pretraining can’t just chase “bigger”; it has to become “more specialized.” You wouldn’t put a race‑car engine in a delivery truck and expect it to go fast.

Fine-Tuning Got a New Trick: Not Just Fine-Tuning, but Span Fine-Tuning

No matter how valuable pretraining is, it still needs fine‑tuning to release its power. At EMNLP 2021, a paper called Span Fine‑tuning for Pre‑trained Language Models proposed span‑level fine‑tuning. What does that mean? Previously, fine‑tuning was either token‑level or sentence‑level—very coarse. Now, with span‑level semantic constraints adapted to downstream tasks, they add pretraining signals on entire words, entities, and span information, achieving consistent improvements.

This is especially friendly to tasks like question answering and named entity recognition—it turns the whole “pretrain–fine‑tune” process from a one‑size‑fits‑all banquet into a custom‑tailored meal.

Applause for the Technology, Tears for Engineering

Papers kept coming out, models kept getting stronger. But what was the real voice from engineers on the front line?

The Baidu engineer’s words from before cut straight to three core conflicts:

Computing cost vs. rapid iteration: Models are big and slow, one version per month—business just has to wait and die.
Model complexity vs. inference latency: Parameter counts go up, but running the model becomes as choppy as a slideshow.
General pretraining vs. production baseline: The old LSTM has been stable online for ages, and the new model offers barely any marginal benefit.

In 2021, pretrained models had ballooning parameters, expanding language coverage, and increasing task diversity. But what about the “soft” engineering capabilities—model compression, inference acceleration, data efficiency? Did they mature in sync? Not at all.

A good model isn’t just a single point on a leaderboard. As the saying goes, “You can go ahead and dominate every benchmark; whether the business actually uses you is another story.”

Final Thoughts: Bigger Isn’t Always Better—What’s Smarter Is What’s Pricier

From Word2Vec in 2013, to BERT in 2018, to the blossoming of a hundred flowers in 2021—pretrained models have come a long way. But there’s still a long road ahead to get to “works well, easy to use, cheap to run.”

The future direction isn’t to keep piling on parameters, but to make models lighter, faster, and more business‑aware. After all, a truly good model is one that lets engineers sleep soundly at night—not one that explodes their GPU at 2 a.m.

If you can’t get the business side to use it with a smile, then all the SOTA scores in the world are just your own lonely self‑admiration.

NLP预训练模型(2021版) (English)

NLP预训练模型(2021版) (English)

Bigger Is Better, Right? Then It Got Schooled by an LSTM

One Brain, 96 Languages! But Then Came the Problems

General? Better to Specialize! Millions of Dialogue Tokens Beat Billions in General Data

Fine-Tuning Got a New Trick: Not Just Fine-Tuning, but Span Fine-Tuning

Applause for the Technology, Tears for Engineering

Final Thoughts: Bigger Isn’t Always Better—What’s Smarter Is What’s Pricier

Cael Lee

Ready to get started?