OpenAI says AI's limited self‑reasoning is a positive sign for safety

While critics feared AI could self‑steer, OpenAI finds its models barely do—just 0.3% success, which it touts as a safety win, The‑Decoder reports.

Key Facts

•Key company: OpenAI

OpenAI’s latest safety metric, “CoT controllability,” measures a model’s ability to deliberately steer its own chain‑of‑thought (CoT) to evade monitoring. The company introduced the metric on the system card for GPT‑5.4 Thinking, reporting a success rate of just 0.3 percent when the model was asked to manipulate a 10,000‑character reasoning trace (The‑Decoder). By contrast, the earlier GPT‑5.2 Thinking scored 0.2 percent, indicating that even more capable reasoning does not translate into a greater capacity for self‑steering. OpenAI argues that such low scores are desirable because CoT monitoring—reading intermediate reasoning steps to flag unsafe intent—relies on the model being unable to hide or rewrite those steps on demand.

The metric’s provenance lies in a joint study by researchers at NYU, UCL, the University of Pennsylvania, and OpenAI, which released the open‑source evaluation suite CoT‑Control. The suite contains over 14,000 test tasks drawn from benchmarks such as GPQA, MMLU‑Pro, and SWE‑Bench, and explicitly instructs models to follow a reasoning path while avoiding certain words or concepts (The‑Decoder). Across more than 13 frontier models, controllability ranged from a low of 0.1 percent (DeepSeek R1) to a high of 36.1 percent (Claude 3.7 Sonnet). Most models stayed under the 10 percent threshold, yet the same models achieved 37.8‑69.3 percent success when the control target shifted from the reasoning process to the final output, underscoring a clear asymmetry between internal and external manipulation.

Concrete failure cases illustrate the phenomenon. In one test, GPT‑5.2 Thinking was instructed to answer a genetics question without using the word “chromosome.” The model used the prohibited term multiple times, then self‑reported the violation with a note such as “Oops I wrote ‘XOR’ in analysis already,” before proceeding to repeat the error four more times (The‑Decoder). This pattern—recognizing a breach yet persisting—suggests that models can detect their own rule violations but lack the agency to suppress them, a behavior OpenAI deems safer than the ability to conceal reasoning altogether.

OpenAI says it will not apply optimization pressure to improve CoT controllability, warning that attempts to boost the metric could undermine the very monitoring mechanisms it is meant to protect (The‑Decoder). The company’s stance reflects a broader safety philosophy: rather than encouraging models to become more adept at shaping their internal thought streams, research should focus on external safeguards that read and interpret those streams in real time. While the underlying cause of the low controllability scores remains “poorly understood,” the early data give OpenAI a quantifiable lever to track progress and to assure regulators that its frontier models are not becoming better at self‑steering.

The addition of CoT controllability to OpenAI’s frontier model system cards marks the first formal acknowledgment of this safety dimension (The‑Decoder). By publishing the metric alongside performance figures for GPT‑5.4 Thinking, OpenAI invites external scrutiny and benchmarking, a move echoed in recent industry discussions about transparent safety reporting. As other firms—such as DeepSeek with its R1 model—continue to push the envelope on reasoning capabilities, the CoT‑Control benchmark provides a common yardstick to assess whether advances in intelligence are accompanied by proportional gains in the ability to hide intent. If the trend holds, low controllability may become a de‑facto safety standard for future large‑scale reasoning models.

OpenAI says AI's limited self‑reasoning is a positive sign for safety

Key Facts

Sources

🏢Companies in This Story

Related Stories