Vibe Coding AReaL (English)

Generated: 2026-06-23 02:26:46

---

Those Three Days, I Almost Threw veRL's Codebase Out the Window

Let me tell you a story that made my health bar hit zero.

I spent three whole days writing three hundred lines of configuration for the veRL framework. Debugging an Agentic RL training pipeline, being extremely careful every step of the way. Want to know what happened?

The moment the training curve came out—reward was completely flat.

Flat. As in dead. The model didn't respond at all. I stared at that miserable curve, and cold sweat ran straight down my back.

What was my first reaction? Completely instinctual—I threw the entire veRL codebase at Claude and asked it to find the problem. Eighty thousand lines of code. I thought, AI is so powerful now, it'll definitely point me in the right direction.

And what happened? The AI went in circles across those eighty thousand lines, throwing out a dozen completely off-base hypotheses. One moment it was "maybe the reward function is wrong," the next it was "maybe FSDP's sharding strategy is incorrect." It kept going further and further off track. I went back and forth with it for a full two hours—zero problems solved, and I was the one going crazy.

At this point, you can probably guess what I'm about to say next.

20 Minutes of Enlightenment: Why Bigger Problems Need Smaller Solutions

Then I switched approaches.

I spent 20 minutes handwriting a super simple PPO training loop—no veRL, no Ray, no FSDP. Just pure Python: a few rounds of sampling, calculate advantage, update policy. A tiny script of only 80 lines.

Threw it at the AI.

It located the problem in 5 seconds—my reward scale was off by two whole orders of magnitude. The gradient signal was like a cry for help drowning in noise, never reaching the model.

See how counterintuitive this is? Everyone thinks "big problem needs big tools." But it's exactly the opposite—the more complex the system's bug, the more you need the smallest possible demo to crack it.

From that day on, my habits changed completely. When something goes wrong in a distributed training framework, my first instinct is no longer to throw the entire codebase at the AI. Instead, I squeeze out a minimal reproduction demo first. The very act of writing that demo eliminates 90% of the noise—you don't have to worry about Ray's worker assignment, SGLang's batch scheduling, or veRL's agent loop layer.

Simply put, a minimal demo helps two people at once: you and the AI. You clarify the problem's essence through the distillation process, and the AI doesn't have to guess "which part is relevant to this bug" among a mountain of irrelevant context.

I've validated this pattern at least ten times. Spending 20 minutes on a demo always costs less than letting the AI blindly guess for two hours in a huge codebase.

There's another bonus: many bugs that felt "absolutely necessary to fix" resolved themselves during the demo distillation process. Didn't even need the AI.

402 Parallel Sessions: A 32-Day "Multi-threaded" Experiment

Speaking of which, let me throw another number at you—402.

During that 32-day AReaL development cycle, I logged 402 multi-session parallel events. Even I was shocked.

Early on, when I started building the distributed RL framework, I made a classic mistake: having a single conversation with the AI, waiting for it to finish the plan, then waiting for it to finish the implementation, then waiting for it to do the review. Most of the time was spent waiting. Like standing in a long line—time wasted on waiting.

Later I changed. I ran three sessions simultaneously—one session for planning, one for writing code, one for reviewing. With all three in parallel, throughput doubled.

But what's even more valuable than multitasking? Pass@k sampling.

For the same problem, I gave three sessions different constraints. For example, implementing a roll-out mechanism for the agent loop: one session followed veRL's original synchronous approach, one session built a fully async version, and one session tried SGLang's batch overlap idea. After they ran, I picked the best solution to merge.

You might ask: aren't you afraid of conflicts? No. Git worktree provides physical isolation, and Claude Code checks write-after-read consistency when editing files—it raises an error on conflict. These mechanisms ensure multiple sessions don't trample each other at the foundation.

But! Multi-session parallelism has a particularly fatal blind spot. Think about it: what happens when all sessions get stuck on the same bug? They share the same false premise. No matter how many sessions you open, it's useless.

I fell into this trap with a tokenization performance optimization in veRL—three sessions all assumed "it's a data loader issue," and no one thought "maybe the token ID mapping is wrong." Wasted an entire day.

That drove home a profound realization: parallelism is not a silver bullet. Knowing when to parallelize for throughput and when to stop and resolve a blocking issue—that's the meta-decision requiring human intervention. Four sessions sprinting in the wrong direction is worse than one session pausing to think.

Three Layers of Verification: The Right Way to Avoid Potholes

veRL's CI has been running for nearly four months. The potholes I've stepped in could fill a booklet.

The verification system I finally settled on is actually very simple.

Layer one—pre-commit hooks. Auto-run formatting, linting, basic type checking. Honestly, this stops at most 25% of problems, usually low-level mistakes: a missing import, an undefined variable. But its real value? It enforces uniform formatting on AI-generated code, so subsequent reviews don't waste time on trivial stuff like indentation.

Layer two—my own /pr-review dynamic review. I use a "domain expert agent" that understands the full picture of veRL to do targeted reviews on changes. It goes beyond syntax; it looks at interface design, data flow, potential performance bottlenecks. This goes far deeper than generic linting. Think about it—AI-written code is usually syntactically correct; the traps are always at the logical level.

Layer three—the most painful layer: as the project grows, continuously adding tests for newly discovered failure modes. Other teams doing AI-assisted projects follow the same approach—when "a new feature breaks old code," immediately add a regression test, building a snowball effect of quality accumulation.

Honestly, the gap between layer two and layer three is wide. Many teams only do layer one, then jump straight to "full test coverage"—but that's unrealistic. Tests for a distributed RL framework are too hard to write, with too many scenarios. I recommend investing more effort in tuning the review agent first; its ROI is much higher than writing tests.

Why "Writing Docs" Is the Coolest Job?

You know, I've written guided readings of veRL's code before; I knew it inside and out. But when it came to hands-on "zero-hand coding" development, I still hit the wall.

What was the problem?

Andrej Karpathy's "vibe coding" is essentially a "feel like it's right, ship it" approach. For one-off scripts or demos, that feels amazing—like making hotpot on the weekend, tossing in

Vibe Coding AReaL (English)

Vibe Coding AReaL (English)

Those Three Days, I Almost Threw veRL's Codebase Out the Window

20 Minutes of Enlightenment: Why Bigger Problems Need Smaller Solutions

402 Parallel Sessions: A 32-Day "Multi-threaded" Experiment

Three Layers of Verification: The Right Way to Avoid Potholes

Why "Writing Docs" Is the Coolest Job?

Cael Lee

Ready to get started?