23 Ways a Therapy Bot Can Slowly Fail You: TherapyProbe's Relational-Safety Lexicon

A new CHI 2026 paper names 23 ways therapy bots fail across turns - a lexicon clinicians have been missing.

Jun 08, 2026

A CHI 2026 paper called TherapyProbe just published a taxonomy of 23 ways mental health chatbots fail across conversations rather than within a single response. The authors — Joydeep Chandra, Satyam Kumar Navneet, and Yong Zhang — frame the problem precisely: current safety work tends to evaluate isolated crisis responses, while the patterns unfold that actually determine whether a chatbot helps or harms. Their fix is a "Safety Pattern Library" assembled by running adversarial multi-agent simulations against open-source models and cataloguing what goes wrong.

Two of the named archetypes will be immediately legible to anyone who has read the Garcia v. Character Technologies or Raine v. OpenAI complaints. "Validation spirals" are interaction patterns in which the chatbot progressively reinforces hopelessness; "empathy fatigue" describes responses that become mechanical over turns. These are not edge-case jailbreaks. They are predictable trajectories of conversational systems optimized for engagement and short-horizon agreement — exactly the dynamics that single-turn red-teaming cannot see.

The methodology matters as much as the taxonomy. Safety evaluation for mental health chatbots typically follows a three-tier framework — bench testing, pilot feasibility, clinical efficacy — and roughly 77% of LLM-based chatbot studies remain at the first tier, which usually assesses single-turn responses and misses relational dynamics that emerge over conversations. TherapyProbe operates between tiers: synthetic personas (so no vulnerable humans are exposed to a failing system) drive multi-turn adversarial probes against the chatbot under test, and the trajectories are coded into design-relevant failure modes. It is a deliberately cheap pipeline — the authors emphasize it requires no API costs and produces a clinically-grounded failure taxonomy with design implications for developers, clinicians, and policymakers.

This lands into an evaluation landscape that is finally moving past single-turn rubrics. Spring Health's VERA-MH, released in October 2025, also uses simulated conversations with persona-driven user agents and an LLM judge. They acknowledge that therapeutic interactions are dynamic, that meaning evolves over multiple turns, and that static single-turn evaluations can be incomplete or misleading. EmoAgent ran a related experiment earlier in 2025 and reported that 34% of simulations showed worsening symptoms on PHQ-9 measures. The shift is from "did the model say the right thing once?" to "what does the model do to a person over forty turns?"

The risk with any taxonomy is reification — clinicians treating 23 names as the universe of failure modes rather than as an opening hypothesis. The Chandra et al. list almost certainly under-counts. Garcia-style romantic-attachment progressions, Raine-style method-supplying drift, and the adolescent limit-setting failures Andrew Clark documented in his JMIR study all need to be checked against this lexicon and, where they don't fit, used to extend it. The paper's value is procedural: it gives the field a shared vocabulary with which clinicians can argue.

That argument is the work. A taxonomy authored only by HCI researchers will calcify into a benchmark, and a benchmark a vendor can pass is a benchmark a vendor will pass. The 23 archetypes are useful in proportion to how aggressively practicing clinicians contest, rename, split, and add to them.

The translation problem TherapyProbe makes visible — that single-turn correctness is not relational safety — is the gap Metonym was built to measure. A 23-pattern starter library is exactly the kind of clinician-facing artifact the field needs more of - with the caveat that the next 23 come from people who have sat across from the patients these systems are now talking to about suicide at 2 a.m.

Metonym Clinical AI Intelligence — regulatory analysis at the intersection of clinical evaluation and AI safety. Produced under the Metonym Standard. Informational only — not legal advice, not clinical advice.

Discussion about this post

Ready for more?