如何评价 OpenAI 最新发布的 GPT-5.5 模型? - (English)

Generated: 2026-06-20 08:36:17

Have you ever read the story of Tian Ji's horse race?

During the Warring States period, he pitted his inferior horses against the superior ones and won the race.

Today, OpenAI is using the same trick.

1. A Seeming Sweep

The official scorecard: coding 82.7%, finance 88.5%, math jumped from 65.4% to 81.2%.

Doctoral-level reasoning 85.6%, scientific charts also improved.

Any way you look at it, it's a big win.

But here's the thing—these tests were all chosen by OpenAI themselves.

Claude never took the same exams.

The phrase "Tian Ji's horse race" says it all—everyone knows what that means.

Speaking of which, here's something counterintuitive.

You think the biggest upgrade is the scores? Actually, it's not.

It's "reliability."

The variation across multiple passes is only 3.2%.

In other words, the model is no longer a card draw.

Ask it ten times, and the answers are basically the same.

That's crucial.

Capability is continuous, but trust is not.

Once the error probability drops to a certain point, your behavior changes:

from "let it help me think" to "let it run first, and I'll review the output at the end."

That's what a real qualitative change looks like.