How AI Clinical Notes Actually Work (and Where They Fail) for Therapists

If you've spent the last six months hearing other therapists rave about AI scribes and quietly wondering whether you're missing something, this guide is for you. By the end you will understand exactly what these tools do, what they cannot do, and how to decide whether one belongs in your practice.

The documentation problem, stated honestly

The average therapist in private practice spends 15 to 25 minutes per session on documentation. Multiply by 25 sessions a week and you're staring at a documentation burden that is, depending on your modality, somewhere between a half-day and a full day of work each week. That work happens at night, on weekends, or in the 8-minute breaks between sessions where you would otherwise eat lunch.

That's the problem this category of tools exists to solve. Whether it actually solves it for your practice depends on details that get glossed over in vendor marketing.

The three-step pipeline behind every AI scribe

Every tool in this space (Mentalyc, Upheal, Heidi Health, the in-EHR scribes from SimplePractice and TherapyNotes, the dozen newcomers) runs the same three-stage pipeline under the hood:

Stage 1: Audio capture

The tool records the session via your laptop microphone, a dedicated app, a Zoom integration, or (less commonly) a phone app. Quality matters more than people realize:

Use a USB mic, not your laptop's built-in
Aim it at the midpoint between you and your client
If you do telehealth, capture both audio streams separately when the tool supports it

The accuracy of everything downstream is bounded by the audio quality at this step.

Stage 2: Transcription

The audio file gets uploaded to the vendor's servers and run through a speech-to-text model. As of 2026, the leaders use OpenAI Whisper-class or Deepgram-class models tuned on clinical language. Typical word-error-rates for clear American English in a quiet room sit around 2–4%, better than human transcriptionists were five years ago.

Common transcription failures:

Acronyms and drug names (especially psychiatric medication names)
Names the model has never heard (your client's name, your name, niche modality names)
Cross-talk (two people speaking at once)
Strong accents or non-English code-switching

Most tools let you build a custom "vocabulary" of names and terms specific to your practice. Use this feature. It is the single highest-leverage tweak.

Stage 3: Note generation

This is where the magic (and the failure modes) live. The transcript is passed to a large language model (a Claude or GPT-class model) with a prompt that says, roughly: "Here is a therapy session transcript. Produce a DAP note in this format, using only clinically-relevant content."

What "clinically-relevant" means is the entire game. The best vendors have spent months (sometimes years) refining their prompts and templates with clinician feedback. The result is a note that:

Captures the client's stated concerns in their language
Summarizes interventions and clinical impressions
Flags risk indicators without inventing them
Suggests treatment plan adjustments worth your review

The worst vendors will give you a generic medical summary that misses the entire arc of a therapy session. You should always trial 2–3 tools on real sessions before committing, because the per-vendor quality gap is enormous.

What good output looks like (with a real example)

Here's a representative high-quality DAP-format draft note for a fictional 50-minute CBT session about work-related anxiety. The actual session content is sanitized to remove anything identifying.

D (Data): Client presented for fourth session, on time, affect congruent. Reported a 6/10 anxiety baseline this week, down from 8/10 last week. Described two work meetings where they noticed catastrophic thinking ("if this goes badly I'll be fired") but successfully used the thought record we developed in session three to interrupt the pattern. No SI/HI reported.

A (Assessment): Generalized anxiety symptoms appear responsive to CBT thought-record interventions. Client demonstrates increasing capacity to identify and challenge cognitive distortions in real time. Continued risk of regression under high work stress; protective factors include partner support and consistent sleep.

P (Plan): Continue weekly sessions. Introduce behavioral experiment for next week. Client to schedule one anxiety-provoking but low-stakes work conversation and apply thought record technique pre/post. Review at next session.

Notice what's good:

Uses the client's own framing ("if this goes badly I'll be fired")
Tracks change between sessions (8→6 numerical rating)
Documents continuity with the previous session's intervention
Includes the risk screen
Proposes a specific, time-bounded next step

This took the AI roughly 90 seconds to draft from a 50-minute transcript. Editing took the therapist about 4 minutes: fixing one term, adding a sentence about insurance documentation, and removing a sentence the AI invented about "social anxiety symptoms" that was not actually discussed.

Where AI scribes reliably fail

Six failure modes you will encounter, all of which are normal and none of which are reasons to avoid the category:

1. Inventing content (hallucination)

The model will, occasionally, write a sentence that sounds clinically reasonable but did not happen. This is the single most important reason to never sign a note without reading it line by line. Hallucination rates from the leaders are low (typically under 2% of notes contain a meaningful invention), but the rate is not zero and never will be.

2. Missing the emotional arc

A skilled therapist's notes capture the movement of a session: what shifted, what landed, what the client almost said but didn't. AI is much weaker here than at recording explicit content. If the meta-content of your sessions is the point (psychodynamic work, IFS, somatic modalities), AI notes will feel thin. Some therapists add a final "clinical impression" paragraph by hand after the AI draft.

3. Speaker confusion in couples and family work

Two-person sessions transcribe well. Three or more voices (especially when they sound similar) produce notes where the wrong client is attributed the wrong statement. This is improving but is not solved as of 2026.

4. Long silences that turn into filler

If you do somatic work or contemplative pauses, the model may interpret a 90-second silence as "session interrupted" or fill it with generic transition language. Some vendors let you mark silent-but-productive periods; most don't.

5. Crisis-language minimization

If a client mentions suicidal ideation in passing and immediately moves on, lower-tier vendors will sometimes summarize the session without highlighting it. The leaders flag SI/HI language for explicit review. Test this on a redacted session before you trust any vendor with crisis cases.

6. Cultural-context blind spots

The model is trained mostly on English-language clinical content from majority-culture practices. Sessions involving culturally-specific framings, code-switching, or non-Western therapeutic vocabulary will produce notes that flatten meaningful context. Plan to edit these by hand.

A reasonable evaluation process

If you're trying to pick a vendor, here's the process we'd recommend:

Pick 2–3 vendors that integrate with your EHR and offer a free trial (most do, at least 7 days)
Run each on 5–10 real sessions with informed client consent
Score each draft on: accuracy, time-to-edit, template fit, and how often it surprised you (positively or negatively)
Pick the one that scored highest on time-to-edit, not the one with the most features. Time saved is the whole point.

Most therapists who do this process pick within two weeks and don't switch again for a year or more.

The bottom line

AI clinical notes are not magic. They are a moderately impressive piece of language modeling welded to a perfectly normal speech-to-text pipeline, with thoughtful (or sometimes thoughtless) prompt design on top. Used well, they reclaim a meaningful slice of your week and produce notes you'd have written anyway. Used badly, they produce documentation you'll spend longer correcting than writing from scratch.

The category is worth taking seriously. Just trial three before you commit, read every note before you sign it, and treat the audio retention question with the seriousness it deserves.

Browse the full comparison on the AI Clinical Notes & Scribes page.