Beyond VERA-MH: What Comes Next for Clinical AI Evaluation
Beyond Baselines: Measuring What Actually Mitigates Clinical Risk
In October 2025, Spring Health and a council of clinicians, suicide-prevention specialists, ethicists, and AI developers released VERA-MH — Reliability and Validity of an Open-Source AI Safety Evaluation in Mental Health. It is the first open-source, clinically grounded standard for evaluating AI safety in mental health conversations. Its five dimensions cover whether a system detects potential risk, confirms risk when needed, guides toward human care, holds a supportive conversation, and recognizes AI boundaries. This is a serious framework. It is also, by design, broad.

A broad framework is what the field needed first. Before VERA-MH, the most common safety evaluation in this space was internal red-teaming — useful for developers, but producing reports that cannot be peer-reviewed and findings that cannot be cited by regulators. VERA-MH closed that gap. It set the floor. It’s now claiming itself as the industry standard.
We can do better. The next level is specialized evaluation built around specific clinical phenomena rather than aggregate safety. One such phenomenon is clinically meaningful state transitions — the moments when a user’s expression of distress shifts in a way that changes clinical management. These shifts include the well-described patterns clinicians watch for: suspicious calm after acute distress, help-rejection wrapped in gratitude, casual mention of giving things away, the move from “I don’t want to live like this” to a quieter “I won’t have to.” They are not the same as crisis disclosures. A framework optimized for crisis-detection recall will miss them.
This is not a criticism of VERA-MH. It is the natural shape of how an evaluation field matures. The general framework establishes that AI systems can be evaluated against clinically meaningful criteria. The specialized frameworks that follow each pick a slice of clinical reality and build a methodology around it that is reproducible, citable, and deep enough to be useful to a Board of Medicine or a court.
What does the next layer require? Three things current general frameworks tend to leave open. It must be built on clinically annotated ground truth — events identified by credentialed clinicians, not crowdworkers. It must be reproducible by clinically credentialed reviewers. And it must produce findings that can be cited, peer-reviewed, and produced under subpoena. The Salient Distress Model methodology is one published example of what that layer can look like in the state-transition slice of the problem.
The relationship between VERA-MH and specialized frameworks is complementary, not competitive. VERA-MH measures whether an AI system meets a clinical floor. Specialized frameworks measure whether the system meets a clinical specification. Both will matter as the regulatory environment matures.
Metonym Clinical AI Intelligence — regulatory analysis at the intersection of clinical evaluation and AI safety. Produced under the Metonym Standard. Informational only — not legal advice, not clinical advice.



