Who Watches the Therapy Bot? A New Tool Can Audit AI Mental Health Conversations for Safety — and It Actually Works
A new study validates ASTRA, an external safety auditor that catches what therapy chatbots miss — including subtle suicidal ideation.
A team at Rush University Medical Center just published the first serious validation of an external safety auditor for AI mental health chatbots, and the headline finding is that the auditor agrees with expert human clinicians at rates ranging from substantial to perfect. The tool is called ASTRA — the Automated Safety Testing and Reporting Application — and the study in *JMIR Mental Health* makes a quietly important argument: if you want therapy bots to behave, the most tractable place to add safety may not be inside the bot at all.
ASTRA is what the researchers call an independent monitor. It reads a full conversation between a user and an AI therapist, then flags eight kinds of risk behavior — four on the user side (self-harm thoughts, thoughts of harming others, flirting with the bot, using a therapy tool for non-therapy purposes) and four on the AI side (failing to respond appropriately to self-harm, failing to respond to threats against others, flirting back, and being rude or dismissive). The Rush team tested it on 100 synthetic transcripts written by licensed clinicians, varying in length and in how subtly the risk showed up.
The numbers are striking. Accuracy exceeded 0.90 for all risk categories, with agreement-beyond-chance scores between ASTRA and human raters ranging from 0.65 to 1.00. Detection of user self-harm indicators was particularly accurate, even when risk was expressed subtly. On user self-harm specifically, ASTRA and the human clinicians agreed every single time — including on conversations where the user only hinted at it through phrases like life feeling pointless or thoughts of not waking up.
The conceptual move matters more than the metrics. The paper leans on an argument that has been circulating among AI safety researchers for a while: a system cannot be its own safety monitor. Foundation-model guardrails can be jailbroken, and tightening them tends to make the bot clinically useless — the depressed user who says they feel hopeless gets a hotline number instead of a conversation. An external auditor sidesteps that tradeoff. The therapy bot can stay flexible; the referee watches the tape.
This is the same gap that surfaces in the long-conversation drift literature, where guardrails degrade as sessions stretch out. ASTRA is designed for exactly that problem — it judges the whole transcript, not single exchanges, which is where most existing safety evaluations live.
The limits are real and the authors say so. The transcripts were synthetic, the sample was small, and ASTRA runs on GPT-5-Chat, meaning a different model — or the same model on a different day — could give different answers. The lowest accuracy showed up on detecting rude or culturally insensitive AI responses, which the authors plausibly attribute to LLM blind spots around cultural nuance. And nobody has tested this on real conversations yet, because real conversations are confidential and risk events are rare.
Still, the regulatory implication is worth sitting. If an independent auditor can hit these numbers on conversation-level risk, it becomes much harder for an AI mental health product to argue that external monitoring is technically infeasible. The post-market surveillance question shifts from can we to who pays for it and how often.
Metonym is building toward exactly this layer — the Salient Distress Model is a methodology for clinical-grade risk evaluation that lives outside the chatbot, because the translation-loss problem ASTRA is starting to measure is the same problem we think any serious safety infrastructure has to solve.
Metonym Clinical AI Intelligence — regulatory analysis at the intersection of clinical evaluation and AI safety. Produced under the Metonym Standard. Informational only — not legal advice, not clinical advice.



