视觉生成超详细解读 (目录) (English)

Generated: 2026-06-20 12:35:37

---

Alright, I read the article carefully. Facts and figures are fine—you’ve clearly done your homework on the parameters and results of those specific models. The part that needs the most work is the final section “Five Suggestions”—the neat numbered parallelism reads too much like a template output, not natural conversation. The rest doesn’t have much of an AI smell; it’s mostly your own personal style, so no need to force changes. Below is the revised version. Pay special attention to the transition between the fourth and fifth sections, and the rhythm of the final five suggestions—I’ve broken them up for you.

---

After Writing Dozens of Issues on Visual Generation, These 5 Counter-Intuitive Findings Made Me Sit Up at 3 AM

You know what? Lately, I’ve become obsessed with reading papers.

Not the grim, grinding kind of reading—it’s that thing where you’re scrolling on your phone at 2 AM, and the more you read, the more awake you get.

Especially in visual generation. If you look away for a month, a dozen new names pop up. You barely figure out how to train VQGAN, and the next day a paper tells you: VQ? No need.

Today’s post isn’t a boring recap of papers I’ve read.

I’m going to lay out all the pitfalls I stepped in, the crashes I had, and the jaw-dropping moments I experienced while writing this series.

By the time you finish, you’ll at least understand one thing: Which direction is this technical line really heading? What’s genuinely useful, and what’s just polished packaging?

Come on, let’s walk through it together.

---

1. Autoregressive Visual Generation: Continuous Tokens Kicked VQ Off the Table

When you think about autoregressive generation, VQ is probably the first thing that comes to mind. You map images into a discrete codebook and do next‑token prediction like a language model.

But discrete codebooks have a fatal flaw: high fidelity and high compression ratio are sworn enemies.

You want good image quality? You need more tokens. More tokens mean your memory and computation explode. Clunky, hard to train, prone to crashing.

Until MAR showed up.

The first time I saw MAR, I don’t know how to describe it—it was like you just learned to ride a bike, and suddenly someone flies over your head on a rocket.

Kaiming’s team just flipped the table: no VQ, use continuous features for autoregression, and at each step use a diffusion model to predict the next token.

Think about it, how bold is that?

But when I tried it myself, the first pitfall hit me: how many diffusion steps to use?

The paper defaults to 100 sampling steps. That’s for people with A100s.

Me? A 24GB card. 20 diffusion steps with the DDIM sampler to speed things up, and I barely managed to run a simple task. Too few steps (say 5), and the output was a blurry mess, all texture details lost. Upping it to 50 steps improved quality, but each step needed 50 reverse diffusion passes, and memory blew up.

Lesson number one from this: don’t trust the parameters in papers. Those are for the rich folks.

Then came Fluid, the direct offspring of MAR for text‑to‑image.

Honestly, this one gave me more headaches than MAR.

Why? Because aligning text embeddings with visual features is much harder to control in continuous space. Think about it—speech and images aren’t the same language. You have to force them to hold hands in a continuous space.

When I reproduced it, I found that if the tokenizer’s compression ratio is too high (say 8×), the text semantics lose fine‑grained instructions. You ask it to draw “a white cat wearing a red hat,” and it gives you a cat—no hat.

Compression ratio too low? The sequence gets too long, autoregression can’t handle it.

Fluid’s solution was to add a side‑branch text encoder that injects text features directly into the diffusion loss at each step. This design was later borrowed by many.

Speaking of which, I really want to mention TokenBridge—a piece of work I think is severely underrated.

Its idea is especially clever: bridging discrete and continuous tokens via post‑training quantization.

How does it work? First you train a continuous tokenizer, then convert to discrete codes using a lightweight quantization module. That way you enjoy the high fidelity of a continuous tokenizer and still let models accustomed to discrete tokens use it.

But there’s a hidden trap in practice: the codebook learning of the quantization module easily collapses.

I tried two initialization methods and EMA updates. Only EMA with cosine annealing kept it stable. Otherwise, after training for a while, the codebook utilization dropped below 20%. Twenty percent of the codebook in use, eighty percent wasted. How wasteful is that?

SimpleAR is my favorite. 0.5B parameters, autoregressive text‑to‑image, GenEval score 0.59.

What does 0.59 mean?

DiT models with the same score have several times more parameters.

Its secret is just three words: pretraining + SFT + RL.

I ran the three‑stage pipeline myself and didn’t run into many pits—just one: the data quality for SFT.

Seriously, I trained for one epoch using an open‑source aesthetic dataset, and prompt alignment actually regressed. I got annoyed, switched to per‑image filtering, keeping only the top 30% of samples, and finally saw improvement.

This taught me a truth: data quality > data quantity. Feeding 100 good images is far better than feeding 1000 junk ones.

For the RL stage, I used GRPO, which is related to Flow‑GRPO we’ll talk about later.

A practical lesson: the group size for GRPO can’t be too small. I tried 4 samples per group, and the reward variance was huge, with no convergence. The paper uses 64. I dropped to 32 and it worked, but half the speed.

And then there’s Token‑Shuffle. It achieves high‑resolution generation by shuffling the token order instead of using a fixed raster order. It reminded me of random mask prediction for video generation—it works better than fixed order, but inference has more uncertainty, requiring multiple sampling and selecting the best.

Throughout this line, the most important thing I learned is this: the threshold for continuous‑token autoregression is lower than you think.

Really, don’t be intimidated by those fancy papers.

It demands more from training stability and samplers, but it’s far from unreachable.

If you’re just starting, take my advice: don’t jump straight to a diffusion loss. First, get a simple MSE loss for direct next‑token prediction working, then gradually switch to diffusion loss.

Otherwise, you won’t even know if the model broke or if the loss was implemented wrong.

---

2. Diffusion Models Haven’t Been Sitting Still Either

Autoregression is hot, but diffusion models haven’t been standing around.

RAEv2, the enhanced version of RAE. It fixed three problems: reconstruction performance worse than VAE, incompatibility with CFG, and using only the last layer’s features.

What’s the smartest point?

Discovering that RAE and REPA are complementary.

RAE uses the last layer of a vision encoder as its latent, which contains semantics. REPA uses the same encoder’s intermediate features to distill the diffusion model, improving spatial structure.

I tested several backbones. Only DINOv3‑L managed to lift both sides simultaneously. ResNet50 didn’t work—reconstruction was okay, but generation was poor.

In practice, fusing K layers required ablation on K. K=2 gave the best reconstruction, K=4 gave the highest generation quality. I settled on K=3 as a compromise.

Another pitfall was AutoGuidance.

The original RAE couldn’t use standard CFG; it required training an additional weak diffusion model for guidance.

How do you define “weak”?

I tried reducing a third of the channels in the same framework—the effect wasn’t weak enough, so auto‑guidance did nothing. I switched to halving the number of layers and finally saw an effect.

This step is very GPU‑hungry. It’s like training two models.

Speaking of DC‑AE 1.5, it focuses on “structuring” the latent space.

Although the source material says it speeds up diffusion model convergence, in my actual tests it made the training curve

视觉生成超详细解读 (目录) (English)

视觉生成超详细解读 (目录) (English)

After Writing Dozens of Issues on Visual Generation, These 5 Counter-Intuitive Findings Made Me Sit Up at 3 AM

1. Autoregressive Visual Generation: Continuous Tokens Kicked VQ Off the Table

2. Diffusion Models Haven’t Been Sitting Still Either

Cael Lee

Ready to get started?