3块钱训一个AI模型，64M参数2小时跑完实测 (English)

Generated: 2026-06-24 01:02:33

---

"Let me tell you something—I did something absolutely insane recently. I spent just three yuan, the price of a cheap cup of milk tea, and trained an AI model from scratch on my own computer that can actually chat with you!"

"Can you believe it?"

"The first time I saw this project link in a group chat, I had the same reaction as you—'Another clickbait, right?' I even took a screenshot, ready to call it out later. But then? I pulled down the code, rented a cloud GPU, ran through the entire pipeline, and I was completely blown away."

"It really wasn't clickbait."

"It's just too counterintuitive."

"We've heard so many stories about 'training a large model costs millions of dollars' that when something like 'done for three yuan' pops up, our first instinct is: fake. But look, here I am talking to you about it, and it's real."

"Today, let's really dive into this amazing project called MiniMind. 64M parameters, a single RTX 3090, two hours to run, total cost: three yuan. I'll share all the pitfalls I hit and all the lessons I learned."

---

The First Real Question: Is This Thing Just Another Black Box?

"Let me ask you something first."

"A lot of people these days put 'proficient in large model training' on their resumes. But if you dig a little—how is the Attention matrix actually computed? Why is the FFN structured that way? What do you do if loss explodes halfway through training?—chances are they'll start hemming and hawing."

"Everyone's used to loading models with model.from_pretrained from HuggingFace, writing a few lines of LoRA fine-tuning, and calling themselves experts. It's like following a recipe to boil instant noodles and then claiming you're a master chef."

"What excited me most about MiniMind is that it rips off that 'emperor's new clothes' with its own hands."

"The author coded it purely by hand, without using any off-the-shelf libraries. The entire codebase is just a few thousand lines—model architecture, training logic, data processing—all laid bare in front of you. Reading through it line by line is like disassembling a Transformer from start to finish."

"When I read model_minimind.py, my biggest realization was: so many blog posts out there try to explain 'Attention Is All You Need' in a roundabout way, but here it's just a few hundred lines of code. Seeing how Q, K, V are actually computed, how multi-head attention is assembled, how RoPE positional encoding is added—it's more useful than reading ten blog posts!"

"I personally learned this the hard way: the first time I ran it, loss wouldn't drop at all. I wasted ages troubleshooting until I realized the learning rate and warmup steps weren't set right. If I'd been using a packaged library, it probably would have just thrown an error and left me clueless. But because the code was bare, I could print out the gradients of each layer and quickly pinpoint the initialization issue."

"See? This isn't just a training tool—it's a living textbook. The author's message is clear: don't just use it, understand how it's made."

---

The Second Real Question: What Hardware Do I Need? Can My Ordinary Computer Handle It?

"This is the most frequently asked question. If the answer had to be 'you need an A100 or H100', then it wouldn't be relevant to most of us."

"Here are the real benchmarks I ran myself:"

RTX 3090 (24GB VRAM): Training from random initialization, pretraining + SFT fine-tuning, took about 2 hours and 10 minutes, VRAM usage around 15GB.
RTX 4060 (12GB VRAM): Also works; just lower the batch size a bit. Finished in about 3 hours.
CPU (my M2 MacBook Air): I tried it just to see if it would run. It runs, but slowly—took an entire weekend, something like 20+ hours.

"Here are the model specs for you: hidden dimension 768, 8 Transformer layers, 8 attention heads, 4 KV attention heads, vocabulary size 6400, total parameters 63.91M."

"At first I had this illusion: '64M is so tiny, any laptop should breeze through it!' But the reality is, even though the parameter count is only 1/27,000th of GPT-3, you still have to load models, feed data, and compute gradients—VRAM still gets used. However, the barrier to entry is drastically lower—a single mid-range consumer GPU is enough!"

"Warning: The first time I ran it on a Mac, it kept crashing because the MPS backend doesn't fully support certain operations. Switching to CPU fixed it. If you're using Apple Silicon, I'd recommend sticking with CPU, or just rent a GPU from a cloud platform—it's three yuan, don't make life hard for yourself."

---

The Third Real Question: What's the Black Magic Behind 'Three Yuan + Two Hours'?"

"It's not black magic at all. It's just straightforward cost accounting."

"Mainstream GPU cloud platforms in China, like Alibaba Cloud or AutoDL, charge around 1.5 yuan per hour for an RTX 3090. Two hours comes to exactly three yuan."

"A lot of people ask: 'How is that possible? Training large models takes months, not two hours!'"

"The key is the dataset size."

"MiniMind's pretraining phase uses a Chinese corpus of about 1.6GB, mainly from Wikipedia, books, and paper abstracts. For comparison, GPT-3 was trained on 45TB of data—that's nearly 30,000 times larger. The SFT phase? A few tens of thousands of question-answer pairs, either manually annotated or distilled from a stronger model. This scale is perfect for a 64M model: enough data to learn from, but not enough to push training time into days."

"So you see, large model training is expensive because it's 'large'—large data, large parameters, large compute. MiniMind cuts all three, and the cost naturally plummets."

"But let me be clear about the cause and effect: it's not 'for three yuan you can train a model as good as GPT-4'; it's 'at extremely low cost, you can run through the entire large model training pipeline and end up with a usable miniature model'."

"So what's the real value here? Before, if you wanted to learn large model training, you either had to spend hundreds of thousands on hardware, or sit through a bunch of superficial 'Hi, welcome to the large model tutorial' courses. Now, for three yuan, you can build one with your own hands! Could you have imagined that two years ago?"

"Pitfall I ran into: The first time I rented a cloud server, I chose the pay-per-hour model and forgot to shut it down. It ran overnight, costing me dozens of extra yuan. Set an alarm, or just buy a pre-paid package to save yourself trouble."

---

3块钱训一个AI模型，64M参数2小时跑完实测 (English)

3块钱训一个AI模型，64M参数2小时跑完实测 (English)

The First Real Question: Is This Thing Just Another Black Box?

The Second Real Question: What Hardware Do I Need? Can My Ordinary Computer Handle It?

The Third Real Question: What's the Black Magic Behind 'Three Yuan + Two Hours'?"

The Fourth Real Question: Is a

Cael Lee

Ready to get started?