基于深度学习的自然语言处理在 2016 年有哪些值得期待 (English)

Generated: 2026-06-20 15:13:25

---

In 2016, I Made Three Bets—One Nearly Wiped Me Out

Imagine you're sitting in a conference hall in 2016.

Up on stage, a bald-headed big shot is giving a talk on deep learning. Down in the audience, every engineer's eyes are sparkling. The guy next to you is hugging his laptop; on the screen, an LSTM training curve just finished—48 straight hours, and the loss dropped by a tiny fraction. He's so happy he nearly kisses the monitor.

That year, everyone was running on pure adrenaline.

Why?

The 2015 "Tsunami" Had Already Arrived

In 2015, ACL was held in Beijing for the first time. I squeezed into the crowd to hear Christopher Manning speak. He said two words: "tsunami"—to describe the impact of deep learning on NLP.

At the conference, people were passing around papers on the surface, but behind the scenes everyone was asking, "How did you tune your LSTM?" "Which word2vec model are you using?" Back then, traditional tasks like segmentation, POS tagging, and named entity recognition were being challenged one by one by CNNs and RNNs. By the end of the year, word2vec had become standard, seq2seq was reaching new heights in machine translation, and the attention mechanism was beginning to pop up in a few papers.

I wrote in my lab notebook: "2015 proved that deep learning can do a lot. 2016 is the real battlefield."

Guess what? I got the first part right. The ending? Completely unexpected.

---

The Three Bets I Made Back Then—Looking Back, One Nearly Wiped Me Out

Bet One: Custom Applications—Don't Give Me a One-Size-Fits-All Cure

In early 2016, AlphaGo beat Lee Sedol, and the whole field exploded.

Everyone was frantically slapping proven deep learning models onto NLP tasks: Sentiment analysis? Just grab that CNN sentence classification approach. Swap out the dataset and publish a paper. At NAACL that year, there was a paper using neural networks to detect character relationships in novels—very specific task, simple model, but the results were shockingly good.

I did it too.

Back then, I was working on classifying petition letters. I just threw a single-layer CNN at it. The accuracy was 3% higher than SVM. I was ready to pop the champagne. But the moment I switched to a different domain's data, it dropped by 10%.

Ten points, my friend! It's like pulling out a skeleton key only to find the lock is custom-made—it doesn't even fit.

Later I realized that true "customization" isn't just tweaking the output layer. You have to design the input representation and do data augmentation based on the task's characteristics. For example, sentence pair matching—directly concatenating two sentences and feeding them into an LSTM performed worse than the old-school TF-IDF plus cosine similarity.

Crash site: I got lazy and averaged word vectors by position before feeding them into an LSTM. On a question-answering matching task, it got utterly destroyed by traditional methods. I had to go back and add an attention mechanism just to close the gap.

Bet Two: Latent Variables—The Hidden Thread Beneath the Surface

Back then, many NLP tasks liked using CRFs for sequence labeling because they explicitly model dependencies between labels. Neural networks couldn't do that—they were black boxes.

Around late 2015, people started trying to introduce latent variables into neural networks—like a subplot in a script that you never directly mention but can always feel. Variational autoencoders (VAEs) slowly entered the scene.

At ICML 2016, a paper brought VAEs into text processing, like "Neural Variational Inference for Text Processing." Around the same time, another paper on variational autoencoders also grabbed attention.

I ran the experiments specifically. I used latent variables for semi-supervised learning in text classification. It did boost accuracy by a few points using lots of unlabeled data. But training was extremely unstable—the KL divergence would collapse all the time, and a collapse meant a whole week wasted.

I debugged until Sunday night and finally realized the problem was the word vector initialization. Switching to pre-trained GloVe made it work.

What happened later? Latent variables really took off in text generation and controllable generation. But that thread was already planted back in 2016.

Bet Three: The Attention Mechanism—Not Just a Paper Growth Engine

In 2015, Bahdanau's attention paper had just come out. After reading it, my first thought: Can this solve the long sequence problem?

But you know what it felt like running those experiments?

I hand-wrote a batch-major attention mechanism using Theano 0.8. Keras 1.0 had LSTM already, but no built-in attention layer. The summaries I generated were worse than just taking the first three sentences.

Three sentences. It felt like a slap in the face.

What went wrong? The bidirectional RNN alignment back then was too heavy-handed—it often pushed the subject and predicate into different sentences. Later I added dropout (keep_prob=0.7) to the decoder, forcing it not to fixate on one position, and only then did it become decent.

But attention did significantly improve long-text processing. By the second half of the year, if a paper at ACL or EMNLP didn't have some attention mechanism, it was practically unpublishable.

---

The Pitfalls Nobody Took Seriously Back Then—Now They're All "I Wish I'd Known"

Looking back, we were way too optimistic in 2016. Some problems were right in front of us, but enthusiasm buried them.

Can Embeddings Really Represent Everything?

You might not believe it, but doing topic detection, I found that word vectors often carried a lot of irrelevant semantics.

Take "apple"—in food corpora it's close to fruit; in tech corpora it's close to phones. Mix the two domains, and the embedding sits in the middle, not quite aligning with either.

Even worse was the out-of-vocabulary problem. Many specialized terms had no pre-trained vectors. Using the UNK token lost a ton of information. I tried character-level bigram embeddings to replace word vectors, with a CNN on top at the character level. Performance improved, but computation doubled.

And numbers and dates like "2016" and "2017"—embeddings couldn't handle them at all. It's nearly impossible to get the model to learn the temporal order.

Later I saw Yoon Kim's 2014 EMNLP paper on sentence classification, where he mentioned that stacking more convolutional layers barely improved sentence classification performance. I replicated it—his GitHub was a genuine blessing, clean code, clear comments. Results? The F1 difference between one-layer CNN and two-layer CNN was less than 0.5 percentage points. And even that wasn't stable—sometimes one layer worked better.

The Seemingly Almighty Seq2Seq

In 2016, many treated seq2seq as a universal tool.

But it had two fatal flaws: handling unknown words, and generating long text that tends to repeat itself. I wrote a classical Chinese poem generator (just for fun) and found that without attention, the five-character verses kept recycling the same lines like "bright moon shines on tall buildings." Adding attention gave me variations like "tall buildings gaze at bright moon"—but still lacked thematic coherence.

Back then, where were the large pre-trained models? You had to feed your own data and tune beam search. I tried beam sizes from 1 to 10, and found that around 5 was best—anything bigger started repeating sentences. These parameters seem basic now, but back then I extracted them one by one by running hundreds of jobs.

---

For the Hardcore Readers—Three Must-Read Papers

If you want to relive that 2016 turning point, start with these three:

"Neural Variational Inference for Text Processing" (ICML 2016) — See how VAEs entered NLP. Later controllable generation and diffusion models all trace back to this.

**Yoon Kim's "Convolution

基于深度学习的自然语言处理在 2016 年有哪些值得期待 (English)

基于深度学习的自然语言处理在 2016 年有哪些值得期待 (English)

In 2016, I Made Three Bets—One Nearly Wiped Me Out

The 2015 "Tsunami" Had Already Arrived

The Three Bets I Made Back Then—Looking Back, One Nearly Wiped Me Out

Bet One: Custom Applications—Don't Give Me a One-Size-Fits-All Cure

Bet Two: Latent Variables—The Hidden Thread Beneath the Surface

Bet Three: The Attention Mechanism—Not Just a Paper Growth Engine

The Pitfalls Nobody Took Seriously Back Then—Now They're All "I Wish I'd Known"

Can Embeddings Really Represent Everything?

The Seemingly Almighty Seq2Seq

For the Hardcore Readers—Three Must-Read Papers

Cael Lee

Ready to get started?