Informer: 一个基于Transformer的效率 (English)

Generated: 2026-06-23 03:51:23

---

Brother, don’t rush to put Informer on a pedestal just yet! I get it—AAAI 2021 Best Paper, long sequence forecasting, a double win in computational efficiency and accuracy—it gets your blood pumping, right? But if you think it can pick stocks, predict your website’s DAU, or be a one-size-fits-all solution ... then I suggest you take your hands off the keyboard for a minute and let me tell you about the pitfalls I’ve stumbled into. After running the model, you might just end up frustrated.

Here’s the deal. I had a project—predicting the compute load of a data center for the next week. The input was minute-level data from the past 20 days, 2880 points; the target? The next 7 days, 10080 points. A classic LSTF (Long Sequence Time-series Forecasting) problem—the name alone tells you how tough it is. I’d tried LSTM, TCN, standard Transformer before—either the predictions fell apart over long horizons, or it crawled along taking forever per step. I was pulling my hair out. When Informer came out, my eyes lit up: finally, a savior! Well, it turned out to be a mixed bag—let me tell you the story today.

Why Did It Win the Award? Three Pain Points, Three Heavy Punches

How painful is the original Transformer when handling long sequences? Three big traps: First, the self-attention computation complexity is O(L²)—once L goes over a thousand, your GPU memory screams and your fans practically take off; second, stacking several encoder/decoder layers gives you memory O(J·L²), so long inputs simply don’t fit; third, during decoding, producing outputs step by step means the longer the prediction, the slower the inference—just like RNN, isn’t that annoying?

Informer throws a punch at each of these: ProbSparse Self-Attention cuts complexity to O(L log L), self-attention distillation halves the input length per layer, and the generative decoder outputs the entire prediction in one go. Sounds perfect, right? I thought so too. But reality ... well, let’s take them one by one.

Pitfall One: ProbSparse Self-Attention—Saves Compute, but Does It Save You Worry?

Its core idea is actually clever: Most self-attention scores are tiny, with only a few dot products contributing the bulk of the attention—a long-tail distribution. So, why not just compute those “important” query-key pairs and ignore the rest? Efficiency goes up. The paper uses KL divergence to pick important queries and controls the sampling number to manage complexity. When I tried it on my data, damn, it was sweet—GPU memory usage dropped straight from O(L²). With a sequence length of 1024, a standard Transformer attention layer needed 1 GB of memory; ProbSparse used only about 150 MB. That’s not just nice—it’s awesome!

But here’s the catch—picking queries depends on a hyperparameter: the sampling factor, which controls how big a fraction to pick. The paper sets it to 5 by default (pick 5 queries). I didn’t think much of it at first and just ran it on a highly periodic dataset (power load, with a 24-hour cycle). Guess what? I found that the smaller the sampling factor, the faster the training, but if you don’t tune the threshold just right, the prediction error suddenly shoots up like a wild horse. Once I got lazy and didn’t tune it, switched to another high-frequency financial dataset (transactions per second), and the MSE doubled! Doubled, man! I later switched to dynamic adjustment, adapting the sampling based on sequence length, and it stabilized.

Here’s the counterintuitive insight: You think you’re getting free compute savings? No, it’s efficiency traded for approximation. If your time series has some very sparse but critical patterns that get missed, the model goes blind. So never, ever blindly use the default values. It’s best to sweep this parameter on the validation set, or use the paper’s “long-tail distribution” assumption to first check if your data fits—most industrial data does, but there are always exceptions. Think you’re safe just because you saved compute? Hey, don’t be naive.

Pitfall Two: Self-Attention Distillation—Squeeze Too Hard, and You Lose Information

Informer’s encoder stacks several attention layers. With distillation, after each layer, they use MaxPooling to halve the sequence length. With J layers stacked, the sequence length becomes L/2^J, and space complexity drops from O(J·L²) to around O(L log L). I tried J=3, input 1024 points, output only 128. GPU memory was definitely saved—I happily ran it—and got about a 12% higher MSE for long-term predictions (100 steps ahead)! 12%! Why?

This gets interesting. Think about it: every pooling operation “condenses” information by picking the positions of maximum values to dominate attention. But what if key information is spread across patterns at different scales? For example, periodic components mixed with trends—pooling might just drop the trend. I later ran a controlled experiment: changed only the distillation, kept everything else the same. On a dataset with clear trends and seasonality, the model with distillation performed noticeably worse on long-term forecasts. I then switched to a gentler distillation strategy, reducing length only to 1/3, and the performance came back.

So distillation is a double-edged sword—it saves memory, but beware of information loss. If your input sequence is already long enough, consider using fewer layers or skipping full distillation. Or, like the paper’s ensemble approach, stacking multiple outputs and averaging. Don’t squeeze too hard out of excitement, or the model goes blind.

Pitfall Three: Generative Decoder—One‑Shot Output, but There’s a Price

I genuinely love this design. The original Transformer decoder spits out time steps one by one, slow as a snail. Informer’s decoder takes a “start token” plus a sequence of zeros as placeholders, does a single forward pass, and outputs the full prediction directly. Inference is an order of magnitude faster! In my project, predicting 10080 points, the original Transformer step‑by‑step took nearly two minutes; Informer’s generative decoder did it in under ten seconds. Look at that comparison—pretty satisfying, right? It felt great!

But at what cost? The design of the decoder input is critical. You need to concatenate the last known segment of the history (the “start token” in the paper) with a placeholder (e.g., all zeros or the average), feed the whole thing into the decoder, and use masked self‑attention so the model only looks at the history and the right‑hidden part of the placeholder. If the start token is poorly chosen, or the placeholder deviates too much from the real trend, the model might just go off the rails—you know, flying off on its own.

I tried two placeholder strategies: all zeros vs. extrapolating using the mean of the last few historical points. The latter worked significantly better. So don’t be lazy and just use zeros—give the decoder some prior info. Also, while the generative decoder avoids error accumulation, it makes a one‑shot prediction that relies heavily on global dependencies. If there’s a sudden event in the sequence (like a holiday causing a power consumption spike), the model hasn’t seen a similar start pattern, and a one‑shot approach tends to “average out” such anomalies. On a test set, I encountered a load trough around Chinese New Year; Informer couldn’t reproduce that sharp drop because it relied more on learned periodic patterns than on the anomaly. With step‑by‑step decoding, you might catch the downward trend in the first few steps and adjust. So if your task is sensitive to sudden anomalies, you might want to keep step‑by‑step decoding as a backup.

Some Will Say: Isn’t It Just Making Attention Sparse? Not Much Innovation, Right?

I’ve heard that a lot. But I want to say—academic innovation doesn’t always require a brand‑new mathematical framework. Being able to improve both efficiency and accuracy on real‑world large‑scale data is already very valuable. I reproduced it on four public datasets (ETT, Electricity, Weather, Exchange), and Informer’s MSE was indeed 10%–30% lower than the then‑SOTA (like LSTNet, DeepAR, Transformer), with inference speed crushing them. What the industry needs are tools that work and are easy to use,

Informer: 一个基于Transformer的效率 (English)

Informer: 一个基于Transformer的效率 (English)

Why Did It Win the Award? Three Pain Points, Three Heavy Punches

Pitfall One: ProbSparse Self-Attention—Saves Compute, but Does It Save You Worry?

Pitfall Two: Self-Attention Distillation—Squeeze Too Hard, and You Lose Information

Pitfall Three: Generative Decoder—One‑Shot Output, but There’s a Price

Some Will Say: Isn’t It Just Making Attention Sparse? Not Much Innovation, Right?

Cael Lee

Ready to get started?