深度学习-第2篇CNN卷积神经网络30分钟入门!足够通俗 (English)

Generated: 2026-06-20 23:50:39

---

Ten years ago, the first time I read a CNN tutorial, I got so mad I threw the book across the room.

True story. That tutorial started with Fourier transforms, sparse connections, gradient derivations… I spent three days and was still stuck on the first page. Later, I built my own project, stepped into countless pitfalls, and finally understood one thing—

CNN isn't actually that mysterious. It's the people writing those tutorials who make it sound that way.

Today, I'm going to break down the core logic of CNN in a conversational way, just like we're chatting. You ready?

---

First, let me throw a question at you: why do we even need CNN?

You might think, deep learning is just stacking neurons, right? So if I build a fully connected network with a hundred layers, shouldn't it be able to learn anything theoretically?

Let me do the math for you.

A 28×28 grayscale image has 784 input nodes. The first fully connected layer with 128 neurons gives you over 100,000 parameters—and that's just the first layer.

What's worse, a fully connected network treats the image as a one-dimensional vector.

Think about it. In an image, neighboring pixels are closely related, and distant pixels have weaker relationships—that's common sense, right? But a fully connected network ignores all that. It treats every pixel equally, which is like taking a complete Lego castle, breaking it into pieces, and telling the model, "Now rebuild it on your own."

The model cries. You cry too.

The first time I ran CIFAR-10 on a fully connected network, after 50 epochs the accuracy was 40%. I switched to the simplest LeNet-level CNN, and it easily hit 60%.

That was the moment I realized: CNN isn't here to compete with you—it's here to save you.

Its core ideas are just three: local connectivity, weight sharing, and translation invariance. In plain terms, CNN gives the neural network a pair of "spatial glasses" that say, "Hey, you're dealing with a 2D image—look at the local first, then the whole picture!"

---

What exactly is the convolutional layer "convolving"?

Back in the day, I watched animations over fifty times before I finally got it. Later I realized it doesn't have to be that complicated.

A convolution kernel is like a magnifying glass in your hand.

Take a 3×3 magnifying glass and slide it over the image from left to right, top to bottom. At each position, multiply the pixels inside the magnifying glass by the weights of the kernel, then sum them up—you get one value. Simple as that.

Where do the kernel weights come from? They're learned. Initially random, and during training they automatically learn to detect certain patterns. Some kernels specialize in vertical edges, some in horizontal edges, some in textures.

One counterintuitive fact: a convolution kernel is just a set of learnable parameters.

Speaking of kernel size—I've made mistakes so you don't have to. At first I always agonized over whether to use 5×5 or 7×7 large kernels. Then I discovered that stacking two 3×3 kernels works better than a single 5×5, with fewer parameters. That's what the VGG series does. Two 3×3 kernels have a receptive field of 5×5, but the parameter count is 3×3×2=18, nearly a third less than 5×5's 25. This trick saved my poor GPU memory back then.

Let me also mention multi-channel input. An RGB image has three channels, each with its own convolution kernel. After convolution, they're summed together. The number of output channels is determined by the number of kernels. For example, if the first convolutional layer outputs 16 channels, then there are 16 sets of kernels, each set covering all input channels.

Warning from personal experience: the first time I wrote code, I set padding to 0, and the feature map shrank so fast that after a few layers it was smaller than the kernel itself, causing an error. Later I memorized this formula, and you should too:

Output size = (Input size + 2×padding - kernel_size) / stride + 1

This formula has saved me three times.

---

Pooling layer: overkill or a stroke of genius?

I used to think it was unnecessary: "Since convolution already extracts features, why compress it further?"

Then one day, on a whim, I ran an experiment: I removed all pooling layers. The training accuracy was okay, but the validation accuracy bounced around like an EKG, and the computational cost skyrocketed. So I quickly added them back.

Pooling has two core purposes: first, reducing spatial dimensions—with max pooling, a 2×2 window and stride 2 cuts the feature map in half, reducing computation to one quarter. Second, enhancing translation invariance—if the image shifts a few pixels to the left, the pooling layer barely cares; the output difference is minimal.

I once did an ablation study: with the same network structure, removing the second pooling layer slowed training by nearly 30% and made overfitting worse, because the parameters didn't decrease, and the model memorized too much positional noise. Since then, I've been a die‑hard fan of pooling.

As for choosing between max pooling and average pooling: in most cases, use max pooling—it retains the most prominent features and loses less information. I only use average pooling at the very end of the network, as global average pooling to replace fully connected layers and reduce parameters—this is what GoogLeNet and ResNet do.

Pitfall: I used to confuse kernel size and stride in pooling. I once used a 3×3 kernel with stride 1, and the feature map size barely changed, wasting computation. So I got into the habit: kernel size equals stride. A 2×2 kernel with stride 2 cuts the size in half.

---

The four parameters most likely to trip you up when building a CNN

The mistakes I've made here could fill a swimming pool. Don't make any of them.

The most common error is messing up the input size or number of channels. When I first used MNIST, with 28×28 grayscale single‑channel images, I wrote my first convolutional layer as nn.Conv2d(3,16,3) and got a channel mismatch error. After that, I made it a habit: no matter what, print out input_shape to double‑check.

Another particularly tricky spot is the dimension of the fully connected layer. After convolution and pooling, the feature maps are flattened into a 1D vector for the fully connected layers, but the network structure is fixed. If the input image size changes, the flattened dimension changes too. For example, after two conv‑pool blocks, the feature map becomes (32,7,7), so flattening gives 32×7×7=1568. Many people just write self.fc = nn.Linear(1568,10) without considering the actual dimensions. My habit is to run a forward pass first, print the flattened size, and then hard‑code it. Alternatively, use nn.AdaptiveAvgPool2d((1,1)) to compress the feature map to 1×1, saving manual calculation. But for beginners, I recommend calculating it manually first to understand the whole process—laziness should come after understanding.

The activation function is another easy mistake. ReLU is standard, no problem. But I made a rookie error: I added ReLU after every convolutional layer, and also after the final classifier layer—that prevented any negative outputs, truncating some class logits, and accuracy couldn't break through. I fixed it by removing the activation from the final fully connected layer, and everything worked fine.

The number of kernels and network depth are also pitfalls. Many beginners stack a ton of layers right away, doubling the number of kernels from 16 to 32 to 512, causing the parameter count to explode and making training prone to vanishing gradients. My advice: start simple. Use 3 to 4 convolutional blocks, each followed by a pooling layer, gradually increasing channel count, like 16→32→64. This baseline can reach about 70% on CIFAR-10. Then gradually add more layers.

---

Five hidden traps in training CNN

This part is rarely covered in online tutorials, but you'll encounter it every day in practice.

If the learning rate isn't set right, the model just spins its wheels. In CNN training, the learning rate has an even bigger impact than in fully connected networks. The

深度学习-第2篇CNN卷积神经网络30分钟入门!足够通俗 (English)

深度学习-第2篇CNN卷积神经网络30分钟入门!足够通俗 (English)

Cael Lee

Ready to get started?