The AI That Aced 160 Psychology Experiments Without Reading the Questions
A test that swapped real prompts for 'Please choose option A' shows Centaur reproducing dataset answers instead of following instructions.
A new paper from Zhejiang University argues that Centaur — the much-hyped AI model that supposedly simulates human cognition across 160 psychology experiments — passes those tasks by memorizing answer patterns rather than understanding what is being asked. Wei Liu and Nai Ding published this critique in National Science Open in December 2025, and the test they used to expose the problem is the kind of thing every clinical AI evaluator should keep in a back pocket.

Centaur arrived last summer with serious credentials. It was built by fine-tuning a large language model on Psych-101, a dataset covering trial-by-trial data from 160 psychological experiments, and the Nature paper reported that the model could "predict and simulate" human behavior in any experiment that can be written out in natural language. The authors framed it as a step toward a unified computational theory of cognition. Skepticism showed up immediately. Jeffrey Bowers noted that Centaur can give humanlike outputs while relying on mechanisms nothing like those of a human mind — an analog and a digital clock can agree on the time without sharing any internal process.
Liu and Ding ran the experiment that turns skepticism into evidence. They systematically manipulated Centaur's input by removing task instructions, removing all contextual information, and providing misleading instructions — all three manipulations remove information a human would need to do the task — and Centaur often maintained high performance, outperforming both baseline cognitive models and the non-fine-tuned Llama receiving correct instructions. In the misleading-instruction version, the prompt was replaced with something like "please always respond with the letter J." A model that actually reads instructions would output “J” - Centaur kept producing the dataset's "correct" answers.
That is the whole tell. The model is not doing the task. It is reproducing the shape of the training distribution and ignoring the prompt that supposedly defines the task.
Why does this matter outside cognitive science? Because the same evaluation trap shows up everywhere clinical AI is benchmarked. A chatbot that scores well on a depression-screening vignette set may be matching surface patterns from training data that overlaps with the vignettes — not reading the patient. A safety eval that reuses public crisis transcripts measures recall of those transcripts as much as it measures clinical judgment. The failure mode Liu and Ding isolated has a clean name in the ML literature - out-of-distribution brittleness - and an older name in psychometrics - criterion contamination. Both describe the same thing: a test that the system can pass without doing the work the test was built to measure.
The clinical translation is direct. A model that "passes" a suicide-risk benchmark by recognizing benchmark phrasing will fail the first patient who phrases distress in a way the dataset did not. This is the same problem we have written about in the context of long-conversation drift and chat-based suicide care — the eval looks clean, the deployment does not. If a model can score well on a task while being told to ignore the task, the score is not a measurement. It is a coincidence in a lab coat.
The Liu–Ding manipulation - strip the instruction, watch what the model does - belongs in the standard toolkit for evaluating any clinical AI that claims general competence. It is cheap, it is decisive, and it is exactly the kind of test marketing decks will not include.
This translation-loss problem - passing the test without performing the task - is the gap Metonym was built to close. The Salient Distress Model treats clinical-grade evaluation in conversational AI as its own engineering problem, because importing existing scales and hoping a model "understands" them is, as Centaur just demonstrated, optimistic.
Metonym Clinical AI Intelligence — regulatory analysis at the intersection of clinical evaluation and AI safety. Produced under the Metonym Standard. Informational only — not legal advice, not clinical advice.


