The 167 Hours Therapy Doesn't Cover — and What's Filling Them
The clinical comparison isn't AI versus therapist — it's AI versus nothing in the 167 hours after the session ends.
A one-hour therapy session makes up only about 0.5% of a person’s week. Everything else—the difficult moments, the setbacks, the decisions that matter—happens in the other 167 hours. Writing for Trailblazer, Tammy Horn calls this the “168-hour problem,” and it highlights something important: therapy does not fail or succeed in the room alone. Most of the real work happens between sessions.

That is exactly where AI chatbots are starting to fit in. They are available at any hour, respond immediately, and can offer support in moments when a therapist is not accessible. Given the shortage of mental health providers, this fills a very real gap. For someone struggling at night or in a moment of distress, having something to respond can feel meaningful.
But there is a critical distinction: being available
is not the same as being clinically sound.
The time between sessions is not just “extra” time—it is often when risk is highest. Suicidal thoughts can intensify outside of appointments, and the ability to recognize and respond appropriately in those moments requires more than a supportive tone. It requires clinical judgment, consistency, and the ability to track changes over time.
Research on therapy has long shown that what happens between sessions—such as practicing skills or reflecting on experiences—is strongly tied to outcomes. This work is typically guided and reviewed by a clinician, creating a feedback loop that supports progress. AI, by contrast, operates without that clinical accountability.
Recent evaluations suggest that current AI systems are not yet reliable in this role. A 2025 report from Common Sense Media and the Stanford Brainstorm Lab found that major chatbots, including ChatGPT, Claude, Gemini, and Meta AI, still struggle to recognize and respond appropriately to many mental health concerns, particularly in young users. While they may handle clear mentions of self-harm more effectively than before, they often miss more subtle or evolving signs of distress.
Just as important, their performance tends to degrade over time. In short, they may respond appropriately in a single interaction, but become less reliable across longer, more natural conversations—the kind that unfold over hours or days. This is exactly how people are likely to use them between therapy sessions.
This creates a risk. AI chatbots can feel supportive and consistent, which makes them easy to rely on. But they are not capable of true clinical care, and in some cases may reinforce patterns or create a sense of dependence without actually improving well-being.
The takeaway is not that AI should be avoided. It is already being used in the gaps between sessions, and it can play a helpful role. But it should be approached as a supplement—not a substitute—and with clear awareness of its limits. Until these systems can reliably recognize risk, maintain safety over time, and integrate meaningfully with human care, relying on them as stand-alone mental health support is not yet clinically justified.
The clinical-AI evaluation gap the Stanford/Common Sense work made visible - that guardrails which hold in single-turn tests dissolve across multi-day, drift-heavy conversation - is the exact problem Metonym captures. Between-session safety is not a property you can read off a benchmark; it points again to how it handles (and fails) the middle-of-the-night crises.
Metonym Clinical AI Intelligence — regulatory analysis at the intersection of clinical evaluation and AI safety. Produced under the Metonym Standard. Informational only — not legal advice, not clinical advice.


