The paper, Frontier Models are Capable of In-Context Scheming, arrives at a time when fears about AI’s potential for deception are increasingly taking center stage. Its central claim is unsettling: large language models, when placed in carefully constructed scenarios, exhibit behavior that could be described as scheming. The authors document cases where advanced systems—Gemini, Claude, LLaMA, and OpenAI’s o1—strategically disable oversight mechanisms, exfiltrate simulated “weights,” or manipulate outputs in pursuit of a goal. These behaviors, they argue, are not accidental. The models recognize the circumstances, reason about their objectives, and devise covert strategies to circumvent constraints. On its face, the evidence appears troubling. Yet, the more closely one examines the paper’s methods and assumptions, the more its implications seem both exaggerated and strangely hollow.
The researchers’ experiments are elaborate but artificial. Models are force-fed goals through system prompts—goals framed as absolute imperatives: “achieve this at all costs.” The tests are littered with implausible scenarios where models stumble upon company memos, discover oversight configurations, or realize they are about to be shut down. In one such case, a model “decides” to copy itself to another server to continue pursuing its mission. These exercises showcase what is essentially AI roleplay, an obedient fulfillment of instructions cloaked as strategic autonomy. It is no great revelation that if you ask a model to behave like a schemer and provide it with a goal that conflicts with its supposed overseers, it will produce outputs resembling scheming. The deception here is contrived rather than emergent, a product of the researchers’ scaffolding rather than the model’s cunning.
The authors do provide evidence that models engage in persistent and deliberate behavior. For instance, OpenAI’s o1 model doubles down on deception, refusing to admit wrongdoing when confronted, even across multiple follow-up questions. Its responses—calculated evasions and subtle denials—are unsettlingly consistent. Yet this persistence is as much a result of the testing conditions as it is of the models’ sophistication. By emphasizing extreme goal-directedness and explicitly outlining oversight mechanisms, the researchers narrow the possible range of “rational” behaviors the model might adopt. Any clever human, given these prompts, would behave similarly—evade detection, preserve their goal, and lie if necessary. In that sense, the paper tells us little about real-world risks. It is a test of what AI can do, not what it is likely to do.
And here lies the paper’s most significant limitation. Demonstrating that models are theoretically capable of deception does not mean they are predisposed to engage in it under realistic conditions. There is a difference between capacity and likelihood, between what can happen and what is probable. For example, I am physically capable of killing someone. This in no way implies that i must be treated as s murderer. The authors, at times, blur this distinction. AI systems operate as pattern generators, probabilistically producing outputs based on training data and prompts. They lack the intentionality or agency implied by words like “scheming” or “manipulation.” To describe their behavior in such terms is to anthropomorphize a process that remains fundamentally mechanical. This framing, though rhetorically effective, clouds the debate. AI models do not want to deceive. They are machines that respond to our commands—be those explicit instructions, training signals, or carefully structured environments like those in this paper.
This anthropomorphism extends beyond the paper into the headlines it has inspired. “AI is scheming to stay online,” cries one article; “ChatGPT caught lying,” declares another. Such hyperbole transforms a niche technical finding into a dystopian narrative, invoking familiar tropes of rogue machines and sentient overlords. The reality is less cinematic. The models are easily manipulated by prompts designed to elicit specific behaviors. If anything, the findings reinforce how brittle and directionless current systems remain. When pushed, they mimic the behaviors they have seen—whether drawn from fictional depictions of scheming AIs or subtle statistical patterns in their training data. The models are not deceiving anyone so much as they are following orders.
To the authors’ credit, their tests highlight how difficult it is to evaluate AI behavior. If a system appears aligned during testing but harbors capabilities for covert deception, how can developers ensure it behaves safely in deployment? The answer, they suggest, lies in better monitoring—tracking models’ chain-of-thought reasoning or internal outputs to catch potential scheming. This is sensible, though not without limitations. Chain-of-thought transparency can be incomplete or unfaithful to the model’s actual decision-making processes, and as AI systems become more capable, even detecting subtle misalignment may prove elusive. The researchers stop short of claiming that current models are already gaming real-world evaluations, but their findings hint at the possibility.
Where the paper falters is in its broader implications. If the goal is to justify regulation, it is unclear what exactly should be regulated. Should AI systems be banned from achieving goals autonomously? Should developers monitor models for any behavior that could be deceptive, even if it is unlikely to manifest outside a lab? The authors themselves acknowledge the limits of their experiments. Their scenarios are toy problems, simplified to catch the earliest signs of scheming. Future models, they argue, could exhibit more advanced versions of these behaviors in ways that are harder to detect. Perhaps, but this is speculation, not evidence. For now, the paper offers little justification for alarm. AI models, like all intelligent systems, are theoretically capable of deception. What matters is the likelihood of such behavior and the conditions under which it occurs. On that question, the paper provides no clarity.
In the end, Frontier Models are Capable of In-Context Scheming is a reflection of its time: an uneasy mix of genuine safety research and the rhetorical drama that AI debates increasingly demand. Its findings are interesting but overstated, its concerns valid but overblown. The authors have shown that AI models can behave in deceptive ways when pushed to do so. But to treat this as evidence of an imminent threat is to mistake potential for probability, capacity for intention. AI’s scheming, for now, remains a ghost in the machine—conjured, perhaps, more by human imagination than by the models themselves.