The Verifier Is the Whole Game
There is a puzzle in how AI capability has developed, and once you see it clearly, a lot of confusing things snap into place. The puzzle is the lopsidedness. Models have become extraordinary at coding, not gradually, but on a steep curve that keeps steepening, while remaining merely competent at most other forms of serious intellectual work. The usual explanation is that coding is somehow easier or more structured. That explanation is wrong, and the right one is far more consequential.
The asymmetry
Reinforcement learning needs a reward. A reward needs to be checkable. Coding comes with a checker that is nearly free: the test suite, the compiler, the sandbox that either runs the program or does not. So RL can iterate on coding with abandon: millions of attempts, each one scored by an oracle that does not lie and does not get tired. The capability curve is steep because the feedback loop is tight, cheap, and trustworthy.
Most other domains have no such oracle. The reward, if it exists at all, is slow, expensive, noisy, or contested. RL starves. Whatever competence the model has in those domains comes mostly from passive absorption during pretraining, not from the active sharpening that RL provides where a verifier exists.
So the model is not "smarter" in some uniform way. It is honed razor-sharp wherever a cheap verifier happens to exist, and left blunt everywhere else. The shape of its capability is the shape of where verification is cheap.
What the verifier actually is
Here is the part that the easy explanation misses. Coding does not have a cheap verifier because coding is simple. Coding has a cheap verifier because software engineering, as a social practice, manufactured one. A test is not a fact of nature. It is a human-written contract that says, in advance: satisfy these conditions and you count as correct. The discipline compressed the question "is this right?" into a machine-checkable agreement, and then handed the agreement to a machine to enforce.
This is a remarkable thing to have done, and it is easy to forget how unusual it is. In most domains, "what counts as correct" cannot be written down in advance as a contract. It is a continuous, social, perpetually-reopenable negotiation. You cannot freeze it into a step-level reward function because it is not the kind of thing that holds still.
That is why RL cannot be fed in those domains, not because the signal is noisy, but because the notion of "correct" is the product of a process, and a process cannot be compressed into a reward at each step. Where the model is sharp, it is sharp because humans already did the work of turning a social judgment into a machine contract. Where it is blunt, it is blunt because no such contract is possible.
The dangerous move
This sets up the failure that defines the current moment. Take the architecture that works gloriously where a verifier exists, autonomous agents, recursive self-improvement, humans pushed up to oversight, and transplant it into a domain with no cheap verifier. The architecture still runs. The agents still execute. But the thing that made the whole apparatus work, the oracle filtering every output for correctness, is simply absent.
What you get is not a slightly worse version of the same thing. You get a machine that produces, at superhuman scale and speed, outputs that look correct and have no relationship to truth. The recursive loop that meant "get better" where a verifier existed now means "get better at seeming right," because seeming-right is the only signal left when the oracle is gone. Scale, pointed at a domain without a verifier, amplifies plausibility rather than truth.
And the most capable systems are the most dangerous here, because their outputs are the most fluently grounded-looking. They will produce conclusions that are internally consistent, well-cited, methodologically tidy, and wrong; and the polish that should signal quality instead provides cover.
Why this is not a passing phase
It would be comforting to think the blunt domains are just waiting their turn, that better models will eventually be sharp everywhere. For some domains, that is true; the verifier just has not been built yet. But for the domains where "correct" is genuinely a social, process-bound thing, the blunt edge is not a temporary lag. It is a structural boundary. You cannot give a step-level verifier to something whose correctness is, by its nature, the slow product of a collective process. The model can get arbitrarily good at the parts that can be contracted, and it will keep hitting the same wall on the part that cannot.
The consequence
If the verifier is the whole game, then the frontier is not where the algorithms are. The reinforcement learning literature has more or less converged on this already: the algorithms have become commodities, and the scarce, decisive resource is reward design, which is to say, verifier design. The teams that build the strongest systems are the ones that can specify and measure quality, not the ones with a cleverer optimizer.
Which means the hardest and most valuable problem is not building another autonomous system. It is the question everyone routing around: in a domain with no cheap verifier, where does a trustworthy reward come from, and how do you keep optimization from collapsing into the production of plausible nonsense?
That question gets dressed up as a technical detail. It is the central problem. Everywhere a verifier is missing (serious science, genuine writing, real judgment) optimizing for "what a human finds good" quietly substitutes for "what is true," and the cost of that substitution is the texture of the output: confident, safe, fluent, and hollow. The whole game is whether you can find something to optimize against that is not just the satisfaction of the reader.
验证者才是整场游戏
AI 能力的发展中有一个谜题,一旦看清它,很多令人困惑的事情就会豁然开朗。这个谜题在于能力的不对称。模型在编程方面变得极其出色,不是渐进式地提升,而是沿着一条不断陡峭化的曲线飞速攀升;与此同时,在大多数其他类型的严肃智力工作中,它们仅仅维持在"尚可"的水平。通常的解释是,编程在某种意义上更简单、更有结构。这个解释是错误的,而正确的答案要深远得多。
不对称性
强化学习(reinforcement learning)需要奖励信号。奖励信号需要可验证。编程恰好自带一个几乎免费的验证器:测试套件、编译器、要么运行要么不运行的沙箱。因此强化学习可以在编程领域放手迭代——数百万次尝试,每一次都由一个不会撒谎、不会疲倦的评判者来打分。能力曲线之所以陡峭,是因为反馈回路紧密、廉价且值得信赖。
大多数其他领域没有这样的评判者。奖励信号即便存在,也是迟缓的、昂贵的、充满噪声的,或者是有争议的。强化学习无从获取训练信号。模型在这些领域所具有的能力主要来自预训练阶段的被动吸收,而不是来自在验证者存在的地方由强化学习所提供的主动磨砺。
因此,模型并不是以某种均匀的方式"变聪明了"。它在凡是存在廉价验证者的地方被磨砺得锐利无比,而在其他地方则一直迟钝。它的能力形状,就是验证成本低廉之处的形状。
验证者到底是什么
这正是简单解释所遗漏的部分。编程拥有廉价的验证者,不是因为编程本身简单。编程拥有廉价的验证者,是因为软件工程作为一种社会实践,制造了这样一个验证者。测试不是自然界的既定事实,而是人类预先写下的一份契约,它说:满足这些条件,你就算正确。这个学科将"这对不对?"这个问题压缩成了一份机器可检查的协议,然后将这份协议交给机器去执行。
这是一件非凡的成就,而我们很容易忘记它有多么不寻常。在大多数领域,"什么算正确"无法被预先写成一份契约。它是一场持续的、社会性的、永远可以重新开启的协商。你无法将它冻结为每一步的奖励函数(reward function),因为它根本不是那种能静止不动的东西。
这就是为什么强化学习在那些领域无法获得养分——不是因为信号有噪声,而是因为"正确"本身是一个过程的产物,而过程无法被压缩为每一步的奖励。模型锐利的地方,是因为人类已经完成了将社会判断转化为机器契约的工作。模型迟钝的地方,是因为这样的契约根本不可能存在。
危险的迁移
这就引出了定义当下这一时刻的那种失败。把在验证者存在时大放异彩的那套架构——自主代理、递归式自我改进、人类被推到监督层面——移植到一个没有廉价验证者的领域。架构照常运行,代理照常执行。但让整套装置真正奏效的那个东西——那个对每一个输出进行正确性过滤的评判者——根本不在了。
你得到的不是同一事物的略差版本。你得到的是一台以超人的规模和速度生产输出的机器,这些输出看起来正确,却与真相毫无关系。在验证者存在时意味着"变得更好"的递归循环,现在变成了"变得更擅长看起来正确",因为当评判者消失之后,"看起来正确"就是唯一剩下的信号。将规模指向一个没有验证者的领域,放大的是似真性(plausibility),而非真相。
而最强大的系统在这里恰恰最为危险,因为它们的输出看起来最为流畅、最像有根据。它们会产出内部自洽、引用得当、方法论整洁,但实际上错误的结论;而本应标志质量的那种精致打磨,反而提供了掩护。
为什么这不是一个暂时的阶段
我们很想安慰自己说,那些迟钝的领域只是在排队等候,更好的模型终将在所有领域都变得锐利。对于某些领域而言,这确实成立——验证者只是还没有被构建出来。但对于那些"正确"本身确实是社会性的、依赖过程的事物而言,迟钝的一面不是暂时的滞后,而是结构性的边界。你无法给一个其正确性本质上是集体过程之缓慢产物的东西提供逐步验证者。模型可以在那些能够被契约化的部分无限精进,但它会在不能被契约化的部分反复撞上同一堵墙。
后果
如果验证者才是整场游戏,那么前沿就不在算法那里。强化学习的研究文献已经或多或少在这一点上达成了共识:算法已经商品化,而稀缺的、决定性的资源是奖励设计——也就是说,是验证者的设计。构建最强系统的团队,是那些能够定义和衡量质量的团队,而不是拥有更巧妙优化器的团队。
这意味着最困难、最有价值的问题不是构建又一个自主系统。而是每个人都在绕开的那个问题:在一个没有廉价验证者的领域,可信赖的奖励信号从何而来,你又如何防止优化过程退化为对似真废话的批量生产?
这个问题被包装成了技术细节。但它才是核心问题。在每一个缺失验证者的领域——严肃的科学、真正的写作、真实的判断——针对"人类觉得好的东西"进行优化,悄然替代了针对"真实的东西"进行优化,而这种替代的代价体现在输出的质地中:自信、安全、流畅,但空洞。整场游戏的关键在于,你能否找到某种可以用来优化的标准,而那个标准不仅仅是读者的满意度。