可解释性:从频域角度解释卷积解码神经网络的表达瓶颈 (English)

Generated: 2026-06-23 06:15:36

---

Believe it or not, three years ago I ran an experiment, and even now, thinking about it sends a chill down my spine.

At the time, I was evaluating on CIFAR-10, applying frequency-domain filtering to the test images of a ResNet-50. After low-pass filtering, the images turned into blurry blocks of color — you couldn't tell what they were with the naked eye. Guess what? The model still gave over 30% accuracy. Can you believe it?

What's even crazier is high-pass filtering — the images became nothing but noisy edges and textures, looking like garbled code to human eyes, yet the model maintained about 50% accuracy.

My reaction at the time wasn't excitement; it was fear.

If I myself can't recognize something, what right does the model have to claim it knows it?

This isn't mysticism! I dug out Xu et al. 2019's paper on the frequency principle again, read it several times, double-checked the data, and confirmed it wasn't a bug. So I started pondering a more fundamental question: The "bottleneck of expressiveness" we talk about all the time — what exactly is it? Is it possible that it's actually hidden in frequency?

---

1. What the Model Sees Is Not What You Think

Let me start with the specific numbers from that experiment. ResNet-50 trained on ImageNet, after low-pass filtering — where only a small central patch of frequency information is kept — the image becomes a mosaic. What's the accuracy? Over 30%. With such limited low-frequency information, the model can still make stable judgments, showing it heavily relies on low-frequency contours and structures.

On the high-pass side, things get more interesting. Remove all low frequencies, keep only high-frequency textures and edges. Human eyes can barely make out the object, but the model's accuracy is even higher, close to 50%. Moreover, as you gradually add more high-frequency information, accuracy improves step by step, unlike low frequencies where "a little addition causes a spike, then saturates quickly."

When I discussed these results with students in my team, someone asked: Doesn't this show that the model uses different frequencies for multi-level decision-making?

I said yes, but not entirely. More precisely, the model has different expressive capabilities at different frequencies, and these capabilities may seriously conflict with each other.

You see, high frequencies correspond to details and textures, low frequencies to global structures and context. The model needs to balance both, but what often happens is: on low frequencies, it appears very "confident" — once low frequencies are sufficient, accuracy shoots up; on high frequencies, it's like a cautious scholar, accumulating evidence bit by bit. This itself suggests that the model's strategy for utilizing information at different frequencies is fundamentally different.

Later I also tried adversarial training and Gaussian augmentation. As that paper mentioned, these augmentation techniques are only effective against corruption in certain frequency bands, and can be useless or even harmful for other bands. For example, adversarial training primarily enhances the model's resistance to high-frequency noise, but has almost no effect on low-frequency perturbations.

This reminds me of an old problem: Why is it that data augmentation often "fixes one thing at the expense of another"?

Looking back now, the answer may lie in frequencies. Different augmentation methods essentially alter the frequency distribution of the training data. Adding Gaussian noise mainly pollutes high frequencies; adversarial perturbations tend to favor high frequencies as well. The model learns to resist interference at those specific frequencies, but what about other frequencies? No one cares.

This is the first bottleneck of expressiveness: The model's learning in frequency space is uneven, and this unevenness directly determines its robustness and generalization.

---

2. The Fancy Facade of Dynamic Convolution

Speaking of which, let's talk about convolution itself. Last year I reviewed a paper on FDConv, and I couldn't help but laugh — not because it was poorly written, but because it exposed something I had suspected but never verified.

Traditional dynamic convolution (e.g., ODConv) boasts multiple parallel convolution kernels that adaptively combine into sample-specific weights. Sounds cool, right? But look at the actual results: the authors plotted the frequency responses of ODConv's four weight groups and found them heavily overlapping. Use t-SNE to visualize, and the distribution of these kernels is all clustered together.

In other words, you give the model eight experts, but these eight experts all think alike, giving similar advice for any problem. So what's the difference between eight and one?

Parameter redundancy, limited adaptivity. I'm not making this up; the frequency response curves in the paper are clear as day.

FDConv's solution is to generate and modulate convolution kernels in the Fourier domain. By using so-called "Fourier-disjoint weights" and "frequency band modulation," different weights genuinely contribute independently at different frequencies. What's the benefit of this? You no longer need so many parallel kernels. Each kernel takes care of its own turf: low-frequency ones handle denoising and structure, high-frequency ones handle details and boundaries.

After reading it, I thought: Isn't this an interpretability perspective from the frequency domain?

At this point, think about the essence of convolution. It's a band-pass filter. Convolution kernels of different sizes and strides naturally respond to different frequencies. Standard CNNs expand the receptive field by stacking many layers, which is essentially processing information from high to low frequencies layer by layer. But the problem is that this layering process is implicit; the model itself doesn't know what frequency it's dealing with.

So when a task requires flexible adjustments of frequency response — for example, different regions of an image need more details or more structure — standard convolution becomes rigid. Dynamic convolution tries to solve this, but traditional approaches only increase the number of parameters without addressing frequency diversity. The result is seemingly flexible but actually redundant.

The second bottleneck of expressiveness: The expressive capacity of convolution kernels in frequency space is severely limited. Multiple sets of weights do not equal multiple distinct frequency responses.

---

3. Where Exactly Is the Bottleneck

These two bottlenecks — uneven frequency learning and single-tone frequency response of convolution kernels — point to the same core issue: The modeling of information at different frequencies inside neural networks is messy, uncontrollable, and lacks a theoretical description framework.

I've been working on interpretability for years. Starting from game-theoretic interaction theory, our team has attempted to quantify and explain the knowledge points of neural networks. For example, we proved that low-order game interactions mainly correspond to global, simple concepts (similar to low-frequency information), while high-order interactions correspond to fine-grained, combinatorial concepts (like high-frequency details). The model learns low-order (low-frequency) interactions more easily; high-order interactions require more data and more complex networks to learn well. This is completely consistent with the frequency domain experiments: low frequency information quickly boosts accuracy, high frequency information accumulates slowly.

More importantly, our theoretical derivation found that: Knowledge points of different complexity have fundamentally different learning difficulties. Low-frequency knowledge (global structures) is easy to learn and generalizes well, but is highly redundant; high-frequency knowledge (local details) is hard to learn and poorly robust, but is often crucial for distinguishing key categories. This contradiction constitutes the deepest bottleneck of expressiveness in deep learning: you cannot have a model that is optimal at all frequencies simultaneously.

Why does adversarial training improve robustness? Game-theoretic interaction analysis shows that it actually suppresses the over-response of certain high-order interactions (high-frequency knowledge), forcing the model to rely more on low-frequency structures. Does this make the model "dumber"? In a sense, yes, but this "dumbness" makes it more robust.

The frequency perspective provides a clean interpretation: adversarial perturbations mainly pollute high frequencies, and to resist, the model proactively reduces its dependence on high frequencies. What's the cost? Its performance on fine-grained tasks decreases, because fine-grained classification requires high-frequency information to distinguish subtle differences. The article mentions that "the same data augmentation improves performance against some corruptions but degrades it against others" — the reason lies here: augmentation methods change the frequency distribution of training data, the model adjusts its frequency preference, but this adjustment cannot simultaneously benefit all frequencies.

So, when I talk about the "expressiveness bottleneck of deep learning," my understanding is: the bottleneck is essentially a resource allocation dilemma of the neural network in frequency space. The model's computational power and parameters are limited, forcing trade-offs between different frequencies. Current structures (including standard convolutions and most dynamic convolutions) do not provide flexible ways to control these trade-offs. The result is often "low frequencies overwhelm high frequencies" or "high frequencies overfit while low frequencies underfit." Hardly any model

可解释性:从频域角度解释卷积解码神经网络的表达瓶颈 (English)

可解释性:从频域角度解释卷积解码神经网络的表达瓶颈 (English)

1. What the Model Sees Is Not What You Think

2. The Fancy Facade of Dynamic Convolution

3. Where Exactly Is the Bottleneck

Cael Lee

Ready to get started?