Home / Blog / 深度学习算法工程师面经微软、阿里、商汤、旷视、滴滴、华为 (English)

深度学习算法工程师面经微软、阿里、商汤、旷视、滴滴、华为 (English)

By CaelLee | | 5 min read

深度学习算法工程师面经微软、阿里、商汤、旷视、滴滴、华为 (English)

Generated: 2026-06-20 21:41:33

---

Let me tell you, after writing tech columns for so many years, "interview tips" are the last thing I want to touch. They look like treasure maps? Step in and it's all traps! You grind through a ton of questions, feel rock solid, and then the interviewer casually takes a detour - and you're completely lost.

Today I'm spilling the dark history of my fall recruitment season—from Microsoft, Alibaba, SenseTime, Megvii, Didi, Huawei, Hikvision, Ping An, Momo, to 4Paradigm. Didn't land many offers, but I sure stepped into a lot of pits. Let's start with deep learning basics—the so-called "Part 1." Here's something counterintuitive: In deep learning interviews, it's not about how many models you've memorized, but how many levels deep your understanding goes. Most people die on Level 1: "I remember it!" But the interviewer is already waiting for you on Level 3.

---

1. ResNet vs DenseNet: Don't Just Memorize the "Why," You Won't Survive the Follow-ups

You ask me about the ResNet structure? I can recite it backwards! Residual blocks, shortcuts, BasicBlock, Bottleneck—easy peasy! Result? In my first SenseTime interview, the interviewer casually said: "What problem does ResNet solve?"

Me: "Vanishing gradients!"

He gave a little laugh: "Batch Normalization already solved that problem—are you sure?"

I was internally exploding! Exactly, BN keeps gradients in check, so what was ResNet really after? Turns out—ResNet tackles network degradation! The training error of deeper networks ends up higher than shallow ones, not because of overfitting, but because the network can't learn anymore! Identity mapping lets the network skip useless layers, and deeper layers are just there to degenerate themselves. If you don't understand this motivation, the interviewer will poke a hole in your argument with ease.

Next question: DenseNet vs ResNet, same number of layers, which is better?

At that time, I answered without thinking: DenseNet is better, fewer parameters. The interviewer instantly grinned: "Fewer parameters, but have you factored in the memory usage?"

Every layer in DenseNet concatenates the outputs of all previous layers—the memory consumption explodes! If you've never trained it yourself, you won't know this trap!

My brutally honest advice: Don't cram structures at the last minute. Take a piece of paper, start from "what difficulties do deep networks face," derive the design of ResNet step by step, draw the parameter counts for both block types, and then compare DenseNet's parameter count and memory usage. Explain it to yourself and record it for playback! Only when you've derived it from scratch will you understand why the interviewer just smiles.

---

2. Convolution Variants: Hand-Calculating FLOPs is a Death Sentence—Don't Be Lazy

"How much computation do depthwise separable convolutions save?" This isn't a concept question—it's an on-the-spot mental math problem! I fell right into this trap at Alibaba—the interviewer asked me to write down the FLOPs for standard convolution and depthwise separable convolution, and I wrote the order of depthwise and pointwise wrong. He said coldly: "You sure you don't want to double-check that?" I wished I could crawl under the table!

Get it right, and remember it: Standard convolution: K² × Ci × Co × H × W; depthwise: K² × Ci × H × W; pointwise: Ci × Co × H × W. The ratio is 1/Co + 1/K². K=3? When the number of channels is large, answer instantly: close to one ninth! Then the interviewer nods.

Even tougher is deformable convolution: How are the offsets learned? The offset for each point is learned by an extra branch. The interviewer pressed on: "Is there any constraint on the offset?"

I froze on the spot—no explicit constraint in the source code! But what he wanted to hear was: Without constraints, the offset can wander outside the image, so usually a constraint range (like -1 to 1) is added. If you haven't read the source code, you won't have this answer!

My brutally honest advice: Derive the FLOPs for these two convolution types yourself and write them down as notes. Run with the official code for deformable convolution, see how the offsets are generated, and why later there's a modulated version. When the interviewer asks about details, being able to blurt out "the code handles it like this" is what really counts.

---

3. From Inception to GAN: Interviewers Like to Hear You Connect the Dots

In my first Microsoft interview, they didn't ask me about a single model—they started with: "Walk me through the Inception series—what was improved in each generation?" I gave a messy answer. Afterward, I drew an evolution diagram:

v1 multi-scale parallelism + 1×1 dimension reduction; v2 two 3×3 convolutions replacing 5×5; v3 convolution factorization + asymmetric convolutions + auxiliary classifiers; v4 introduced deeper Inception structures, while the Inception-ResNet variant added residual connections to go deeper. Interviewers love this kind of "big-picture evolution"—you're not memorizing one model, you're showing that you understand the design trends!

GANs are the same. They don't need you to write out the Loss by heart, but you need to explain how the generator and discriminator play the game and why training is unstable. Then the interviewer follows up: "How does WGAN improve the original GAN?"

You need to answer: It replaces JS divergence with Wasserstein distance, the discriminator outputs a score instead of probability, and adds a Lipschitz constraint. Then they ask: "How is Lipschitz enforced?" Many people get stuck! The earliest version used weight clipping; later gradient penalty worked better. If you haven't read the original paper, you simply can't answer these details.

---

4. You Say You've Memorized All the Interview Questions—Then Why Do You Still Fail? Because You're Reciting Answers!

Someone once yelled at me: "I've memorized this whole interview guide, but I still bombed the interview!"

Well, obviously! There's a world of difference between memorizing and understanding—ten hand-derived formulas apart! The interviewer twists it a bit: Right after asking about ResNet, they ask: "What if I change the shortcut to a 1×1 convolution?" Or, "Why does DenseNet's dense connection blow up memory?" These aren't in your interview guide, but if you've implemented them yourself, trained them on CIFAR-10, and compared memory usage, the answer is right in your hands.

When I prepared for interviews, I took every interview question and explained it to myself in plain language. At the same time, I wrote a minimal PyTorch demo to verify it. Deformable convolution? I wrote a recognition experiment. Depthwise separable convolution? After calculating FLOPs, I ran a speed test to see if theory matched practice. This way, no matter how the interviewer dug deeper, I could always come back to the code level for an answer.

---

C

Cael Lee

Full-stack developer with 8+ years of experience. Currently building AI-powered developer tools. I've tested 20+ AI API providers and coding assistants.

Ready to get started?

Get your API key and start building with 180+ AI models.

Get API Key Free