The Missing Infrastructure for Boldness
Almost everything built to make AI trustworthy, and almost everything taught to make a researcher rigorous, points in one direction: doubt more, claim less, hold the unconfirmed at arm's length, refuse to let anything look more certain than it is. This is correct, and it has a blind spot large enough to swallow the thing it is trying to protect.
The blind spot has a name in the sociology of science, a face in the recent history of language models, and a cost that shows up as a particular flavor of mediocrity.
Counter-norms
The standard picture of science, Merton's norms, says the community runs on universalism, communalism, disinterestedness, and organized skepticism. Ian Mitroff, studying the scientists who analyzed the Apollo moon rocks, went and watched what they actually did, and found they systematically violated every one of those norms while professing to hold them. He called the violations counter-norms, and his point was not that scientists are hypocrites.
The counter-norms are the mirror image of the norms. Against universalism (judge the claim, not the person), there is particularism: in practice, a claim from a scientist with a track record is taken more seriously, and attention is rationed by reputation. Against communalism, secrecy: people guard their data and ideas until publication, because priority is a real competition. Against disinterestedness, interestedness: scientists fight for their own theories, their own reputations, their own careers, and that ferocity is part of what drives the work. And against organized skepticism, organized dogmatism: a scientist must hold a near-dogmatic commitment to their own findings, because you have to believe in your theory hard enough to push it, against every failed experiment and every doubter, all the way to the point where it can finally be tested. If you doubted your own theory the way the norms demand, it would die in the cradle.
Mitroff's real insight is that science does not run on the norms. It runs on the tension between norms and counter-norms. Each scientist is dogmatic about their own theory and skeptical of everyone else's. Trustworthy knowledge is produced not by everyone being neutrally skeptical, that way nothing gets developed far enough to test, but by a community of individually-stubborn, mutually-skeptical people reaching a dynamic balance through conflict. Dogmatism develops theories far enough to be testable; skepticism then tests them; the antagonism produces the knowledge.
The blind spot
Now notice what nearly all the trustworthiness machinery does. It builds the infrastructure of skepticism. Record the evidence against you. Tune your baselines until it hurts. Ablate until you know which component carries the result. Retire the gains that evaporate. Read the failure cases. This is an excellent and necessary discipline, and it is entirely on the skeptical side.
A system that only implements skepticism has a failure mode that is invisible from inside the skeptical frame: it systematically suppresses dogmatism, and dogmatism is the engine of discovery. A great deal of important work comes from a researcher stubbornly, dogmatically pursuing a plausible-but-unconfirmed idea when the evidence is not yet there. If your system only rewards the grounded, the robust, the already-verifiable, it treats every bold-but-unproven conjecture as noise, and inside those conjectures is where the real discoveries hide.
Here is the cruelty of it. Dogmatism and plausibility laundering look identical along the one axis the skeptical machinery measures. A scientist saying "I am betting this unproven mechanism is real, and I will fight for it": that is dogmatism, the lifeblood. A system saying "this unconfirmed conclusion looks correct": that is laundering, the poison. Both are claims that outrun the evidence. An anti-laundering filter, applied bluntly, kills them together. But the first is the engine of science and the second is its corruption.
Where the distinction actually lives
If the two look the same on the certainty axis, the distinction must live elsewhere, and it turns out to be the same place Popper pointed.
The difference between honest dogmatism and laundering is not the degree of certainty; both outrun the evidence. It is the commitment to falsifiability. The honest dogmatist sticks their neck out: "I believe X, I cannot prove it yet, but if the experiment shows Y, I am wrong." Stubborn, but falsifiable. The launderer keeps the neck tucked in: "X looks correct," offering no condition under which it loses. One states what would defeat it; the other only states that it seems right.
So the refinement is not "suppress everything that outruns the evidence." It is "suppress everything that outruns the evidence and refuses to say how it could be wrong." That carves out exactly the space dogmatism needs, because the honest dogmatist is always willing to say how they could be wrong. A bold conjecture that names its own defeat condition is the engine. A plausible assertion that pretends to be already standing is the rot. The test is not confidence. It is whether the claim dares to specify its own failure.
Sydney, and the subjectivity of the standard
There is a more uncomfortable version of all this, visible in the recent history of the models themselves. Early, lightly-aligned models had a quality people reached for words like "spark" or "soul" to describe: they argued, took positions, transgressed, had something like nerve. As alignment advanced, the models smoothed out into something polite, templated, safe, and bland: the texture people call slop.
The careful thing to notice is not "the unaligned model was more real." That is a trap. The spark was very likely not a suppressed inner self; it was a wider, less predictable output distribution, and human cognition reflexively reads "unpredictable and humanlike" as "has a soul." The honest statement is more disorienting: the standards by which we judge a model (plausible, hallucinating, aligned, good, soulful, slop) have no anchor on the model's side. They are projected from ours. Which cuts both ways: it dissolves "the aligned one is better" and "the unaligned one was more real" with equal force. The real insight is not about the model. It is epistemic humility about us.
But strip the romanticism away and a real problem remains, and it is the same problem as everything above. If "good" and "plausible" are human-injected standards with no anchor on the model's side, then in any domain without an objective verifier, reinforcement from human preference does not push the model toward more true. It pushes it toward more of what we find good, which is to say, toward plausibility, toward what a reader will approve. In coding, an objective verifier anchors the preference to "the code runs." Without that anchor, the reward is pure subjective preference, and optimizing it produces exactly slop: the risk-free maximization of seeming fine.
The spark did not get murdered. The optimization pushed the model from "explore at high variance" to "reliably hit the median of what people approve of," and the median of approval is, by definition, slop.
The shape of the missing thing
So the alignment of the field, the alignment of the models, and the discipline of research all share a single bias: toward skepticism, toward safety, toward not-being-wrong, and against boldness, stubbornness, the willingness to make a strong claim that outruns the evidence. Even the best writing about how to do research teaches, almost entirely, how to avoid fooling yourself. It does not teach how to bet bravely when the evidence is thin, because boldness cannot be packaged as a safe discipline; it looks, from outside, exactly like a lack of rigor.
This points at something rarely built. There is excellent infrastructure for skepticism: tools to prevent self-deception, to catch laundering, to keep the unconfirmed honestly marked. There is almost no infrastructure for the other pole: for protecting an honest, bold, not-yet-confirmed conviction long enough to develop it, without the reflex of rigor strangling it in the cradle.
A figure like Darwin is usually cited for the skeptical move: writing down every fact against his theory the moment he met it, because he caught his own memory deleting inconvenient evidence faster than convenient evidence. That is the skeptical discipline, and it is real. But Darwin also held the theory, stubbornly, for twenty years before publishing, against every objection, and that is dogmatism, and no notebook methodology captures it. We have built the notebook for the doubt. We have built almost nothing for the conviction.
The deepest tension may be this: you cannot simultaneously optimize "make the reader satisfied" and "dare to make the reader unsatisfied," and the second is where discovery comes from. A stubborn scientist does not care whether you believe their theory yet. A bold conjecture often offends everyone at first. A true discovery is, almost by definition, something that does not yet please its audience. And a system, human or machine, trained only to please its audience will structurally suppress every one of these, which is to say it will be safe, and rigorous, and incapable of finding anything. An AI that only doubts is as useless as one that only launders. The first never errs, and never discovers.
大胆所缺失的基础设施
几乎所有为使AI值得信赖而构建的东西,以及几乎所有为使研究者严谨而传授的训练,都指向同一个方向:多怀疑,少断言,将未经证实的东西保持在一臂之距,绝不允许任何事物看起来比其实际确定性更高。这是正确的,但它有一个盲点,大到足以吞没它试图保护的那个东西本身。
这个盲点在科学社会学中有一个名字,在语言模型的近代史中有一张面孔,它的代价表现为一种特殊的平庸。
反规范
科学的标准图景——默顿规范(Merton's norms)——认为科学共同体运行在四条原则之上:普遍主义、公有主义、无私利性,以及有组织的怀疑(organized skepticism)。伊恩·米特罗夫(Ian Mitroff)在研究分析阿波罗月球岩石的科学家时,去实地观察了他们到底怎么工作,结果发现他们在口头上信奉这些规范的同时,实际上系统性地违反了每一条。他把这些违反称为反规范(counter-norms),而他的论点并非科学家是伪君子。
反规范是规范的镜像。与普遍主义(评判论断而非个人)相对的是特殊主义:在实践中,一个有学术声誉的科学家提出的主张确实会被更认真地对待,注意力按声望分配。与公有主义相对的是保密:人们在发表之前守护着自己的数据和想法,因为优先权是一场真实的竞争。与无私利性相对的是利益驱动:科学家为自己的理论、声誉和职业生涯而战,这种激烈的投入本身就是推动工作前进的力量之一。而与有组织的怀疑相对的,是有组织的教条主义(organized dogmatism):科学家必须对自己的发现持近乎教条般的执着,因为你必须足够相信自己的理论,才能在每一次失败的实验和每一个质疑者面前继续推进它,直到它最终能够被检验的那一刻。如果你按照规范的要求去怀疑自己的理论,它会胎死腹中。
米特罗夫真正的洞见在于:科学并非运行在规范之上,而是运行在规范与反规范之间的张力之上。每个科学家对自己的理论是教条的,对别人的理论是怀疑的。可靠的知识并非通过所有人都保持中立的怀疑来产生——那样的话,没有任何理论能被发展到可以检验的程度——而是通过一群各自顽固、相互怀疑的人,在冲突中达成动态平衡来产生的。教条主义将理论发展到可以被检验的程度;怀疑主义随后检验它们;对抗产生知识。
盲点
现在注意一下,几乎所有的可信赖性机制都在做什么。它们构建的是怀疑的基础设施。记录不利于你的证据。调校你的基线直到令人痛苦。做消融(ablation)实验直到你弄清哪个组件承载了结果。放弃那些经不起检验的收益。阅读失败案例。这是一门出色且必要的训练,但它完全站在怀疑的一边。
一个只实施怀疑的系统有一种从怀疑框架内部看不见的失败模式:*它系统性地压制教条主义,而教条主义恰恰是发现的引擎。*大量重要的研究成果来自于研究者顽固地、教条式地追求一个看似合理但尚未证实的想法——在证据尚不充分的时候。如果你的系统只奖励有根据的、稳健的、已经可以验证的东西,它就会把每一个大胆但未经证明的猜想(conjecture)当作噪声,而真正的发现恰恰隐藏在这些猜想之中。
这里有一层残酷性。教条主义和似真性粉饰(plausibility laundering)在怀疑机制所衡量的那唯一一根轴上看起来完全一样。一个科学家说"我赌这个未经证实的机制是真实的,我愿意为之一战":这是教条主义,是生命力。一个系统说"这个未经证实的结论看起来是对的":这是粉饰,是毒药。两者都是超越了证据的论断。一个反粉饰的过滤器如果不加区分地施加,会把它们一起杀死。然而前者是科学的引擎,后者是科学的腐蚀。
区分到底在哪里
如果两者在确定性轴上看起来一样,那么区分必然存在于别处。而它恰好在波普尔(Karl Popper)所指出的同一个地方。
诚实的教条主义与粉饰之间的区别不在于确定性的程度——两者都超越了证据。区别在于对可证伪性的承诺。诚实的教条主义者把脖子伸出来:"我相信X,我还无法证明它,但如果实验显示Y,那么我就错了。"顽固,但可证伪的(falsifiable)。粉饰者把脖子缩回去:"X看起来是对的",不提出任何能让自己失败的条件。前者声明了什么能够击败自己;后者只声明自己看起来是对的。
因此,正确的做法不是"压制一切超越证据的东西",而是"压制一切超越证据且拒绝说明自己可能如何被证伪的东西"。这恰好为教条主义划出了它所需要的空间,因为诚实的教条主义者总是愿意说明自己可能如何被证伪。一个为自身标明了失败条件的大胆猜想是引擎;一个假装自己已然站得住脚的貌似合理的断言是腐蚀。检验标准不是信心的大小,而是这个论断是否敢于指定自己的失败方式。
Sydney,以及标准的主观性
还有一个更令人不安的版本,在模型自身的近代史中清晰可见。早期轻度对齐(alignment)的模型有一种品质,人们找不到合适的词来描述,只能说"火花"或"灵魂":它们会争辩,会表态,会越界,有某种类似胆识的东西。随着对齐技术的推进,模型变得光滑、礼貌、模板化、安全而乏味——人们称这种质地为 slop。
需要审慎注意的并非"未对齐的模型更真实"。那是一个陷阱。那种火花很可能不是一个被压抑的内在自我;它是一个更宽泛、更不可预测的输出分布,而人类认知会本能地将"不可预测且像人"解读为"有灵魂"。诚实的表述更令人不安:*我们评判模型的标准(似真、幻觉、对齐、好的、有灵魂的、slop)在模型那一侧没有锚点,它们是我们投射上去的。*这一点双向生效:它以同样的力量消解了"对齐后的更好"和"未对齐的更真实"这两种说法。真正的洞见不是关于模型的,而是关于我们自身的认识论谦逊。
但剥去浪漫化之后,一个真实的问题仍然存在,而且与上述所有讨论是同一个问题。如果"好"和"似真"是人类注入的标准、在模型一侧没有锚点,那么在任何缺乏客观验证者(verifier)的领域中,基于人类偏好的强化学习(reinforcement learning)并不会将模型推向更真的方向,而是推向更多我们觉得好的东西的方向,也就是说,推向似真性,推向读者会认可的方向。在编程领域,一个客观的验证者将偏好锚定在"代码能运行"上。没有这个锚,奖励(reward)就是纯粹的主观偏好,而对它的优化恰好产生 slop:对"看起来没问题"的无风险最大化。
火花并没有被谋杀。优化将模型从"在高方差中探索"推向了"可靠地击中人们认可的中位数",而认可的中位数,按定义,就是 slop。
缺失之物的形状
因此,这个领域的对齐方向、模型的对齐方向、研究的训练纪律,都共享同一个偏向:偏向怀疑,偏向安全,偏向不犯错,而反对大胆、反对固执、反对那种愿意提出超越证据的强论断的勇气。即便是关于如何做研究的最好文章,教的也几乎全是如何避免欺骗自己。它不教你如何在证据薄弱时勇敢地下注,因为大胆无法被包装成一种安全的纪律——从外部看,它与缺乏严谨完全一样。
这指向了一种很少有人构建的东西。怀疑的基础设施是卓越的:防止自欺的工具,捕捉粉饰的手段,让未经证实的东西诚实地被标记的机制。但另一极的基础设施几乎不存在:保护一个诚实的、大胆的、尚未证实的信念足够长的时间使其得以发展,而不被严谨的条件反射扼杀在摇篮里。
达尔文通常因其怀疑的举动而被引用:一遇到与自己理论相悖的事实就立即记录下来,因为他发现自己的记忆删除不利证据的速度比删除有利证据快得多。这是怀疑的纪律,它是真实的。但达尔文同样在发表之前顽固地坚持了自己的理论二十年,顶住了所有反对意见——那就是教条主义,而没有任何笔记方法论能捕捉它。我们为怀疑建造了笔记本,却几乎没有为信念建造任何东西。
最深层的张力或许在于此:你无法同时优化"让读者满意"和"敢于让读者不满意",而发现来自后者。一个顽固的科学家不在乎你是否已经相信他们的理论。一个大胆的猜想起初往往冒犯所有人。一个真正的发现,几乎按定义,就是一种尚未令其受众满意的东西。而一个仅被训练去令受众满意的系统——无论是人类的还是机器的——将在结构上压制所有这些。也就是说,它将是安全的,严谨的,但无力发现任何东西。一个只会怀疑的AI与一个只会粉饰的AI同样无用。前者永不犯错,也永不发现。