What Counts as 'Evaluated' for an LLM Therapy Bot? A Field-Scan of the Frameworks That Are Already in Use
A new JMIR AI systematic review maps how the field actually evaluates therapy chatbots — and the map has large blank regions where clinical validation should be.
A new peer-reviewed systematic review in JMIR AI pulled together every study it could find on large language model chatbots built for mental-health counseling — and the picture of how the field "evaluates" these systems is narrower than the marketing would have you believe. Twenty studies met inclusion. None reported a registered randomized controlled trial or independent clinical validation in real-world care settings. That is the headline finding. Everything else is texture on it.

The review, led by Ha Na Cho and colleagues at the University of California, Irvine, screened nearly 1,600 papers and kept the twenty that actually built or empirically tested a counseling-oriented LLM chatbot between 2020 and May 2025. GPT-family models showed up in 9 of 20 studies, and 18 of 20 used fine-tuned or domain-adapted models like LLaMa, ChatGLM, or Qwen. So the systems being studied are real, and the engineering work behind them is real. The question is what counts as evidence that they work.
Here is where the map gets thin. Quantitative evaluation in the included studies leaned heavily on lexical-overlap metrics — BLEU, ROUGE, distinct-n — which measure how similar the chatbot's text is to a reference response. These tell you about textual similarity, not about whether the conversation was clinically appropriate or therapeutically useful. Eighteen of the twenty studies added human raters scoring things like empathy, fluency, and coherence. That is better, but it is also the same rubric you would use to grade a writing-class essay.
What is missing is the clinical layer. Only a small subset of studies used psychometrically grounded tools such as the PHQ-9 to evaluate mental-health alignment, and no included study reported using instruments like the PHQ-9, GAD-7, or System Usability Scale in a standardized clinical setting. Put plainly: the field is measuring whether the bot sounds like a therapist, not whether the user is any better off.
The ethics column of the map is the most under-populated. Only 3 of the 20 studies briefly mentioned potential harms, and none systematically audited their models for safety in high-risk user scenarios. No study documented mitigation strategies for hallucinations or unintended outputs. Six studies were rated high risk specifically on ethics reporting. This is the same gap that keeps surfacing in the litigation and the regulatory record — a recurring pattern across recent suicide-care chatbot research — and it is now also the finding of a formal systematic review.
Reproducibility tracks the same shape. Only 6 of 20 studies provided public access to source code or pretrained models, and 4 shared any portion of their datasets. If outside researchers cannot rebuild the system, outside researchers cannot test the safety claims. That is a structural problem, not a quirk of any one study.
For anyone watching this space — clinicians, regulators, plaintiff attorneys, procurement officers at health systems — the review is useful precisely because it is boring in the right way. It is a citable, peer-reviewed confirmation that the published evidence base does not yet support the deployment claims being made downstream.
The translation gap this review documents — between language-similarity scores and clinical-grade outcomes — is the exact problem Metonym is building the Salient Distress Model to address. Borrowing scales like PHQ-9 into chat is not the answer the review is asking for; it’s to design evaluation native to the modality.
Metonym Clinical AI Intelligence — regulatory analysis at the intersection of clinical evaluation and AI safety. Produced under the Metonym Standard. Informational only — not legal advice, not clinical advice.


