Measuring What Matters: The Ambient Scribe Edition
How are ambient scribes evaluated? Are those the right metrics? Does it matter?
Hi all, I’m back from a sabbatical to 26 countries with my husband and four school-age kids. I highly recommend taking a sabbatical - or several! It’s a great way to spend time as a family and to reflect on your work and interests. More on sabbaticals and travel coming soon for those who are interested.
Every day, doctors spend hours typing notes into electronic health records - time they could be spending with patients. Enter ambient AI scribes: sophisticated AI systems that listen to doctor-patient conversations and automatically generate clinical notes. They promise to free doctors from their keyboards and let them focus on what matters most - patient care.
Look mom, no (computer in my) hands!
But a few fundamental questions remain:
How do we know these AI scribes are actually doing a good job?
When I talk to healthcare developers, they often say that the concern about performance is all over the map. Some health systems are totally satisfied if some metric - any metric - is being tested. Other health systems are so afraid of making a mistake with AI that they have ten or more rounds of review,each by a person who is afraid of being “the one” who approved the AI system and therefore would get blamed if anything bad happened. Clearly this is not a great situation for developers, but more importantly it encapsulates the tension between anxiety about being left behind by not using and and anxiety about not knowing how to evaluate these probabilistic tools.
How good is good enough?
The question of whether the tools are doing a good job is related to the more important question of whether the AI system is equal to or better than the current state. There are lots of things that healthcare does a bad job at, and physician notes are one of those. 50-80% of physician notes are wrong or unverifiable. At least half of them are copy and pasted. Dictation has never been perfect either, and we rely on doctors to (most of the time) catch the really major errors, and we rely on other doctors and clinicians to understand that the note might not be perfect. If these tools do a better job of communicating information, then tiny differences in accuracy metrics may not matter.
Which parts of healthcare AI evaluation actually matter?
We all love to think that the product that performs the best clinically is the one that will come out on top. But in talking to lots of AI software developers, they get asked about cost, budget impact, and return on investment far more than any clinical metric. If the buyers for the healthcare systems are solely focused on financial metrics, these (admittedly) more complex and hard to understand evaluation metrics may not matter to anyone - except the clinicians who use them.
(Spoiler alert - I’ll dive into these more deeply in the coming weeks!)
The Current State of AI Scribe Evaluation
I just published a scoping review in medRxiv of peer-reviewed studies evaluating ambient AI scribes from the past few years. I found what you’d expect - everyone's using different measuring sticks. Computer science researchers focused on traditional AI metrics like ROUGE scores (which measure how well the AI's writing matches human-written notes), while others by clinicians look at clinical accuracy or how well doctors can use the generated notes.
This diversity of approaches makes it nearly impossible to compare different AI scribes. It's like trying to judge a baking competition where each judge is using different criteria - one's focused on taste, another on appearance, and a third on technical difficulty. Even more confusingly, it’s competition in which the same aspect of a dessert - like taste - is measured in two completely different ways.
The Missing Pieces
The review identified several limitations to current evaluation methods:
Limited Clinical Reality Check: Many studies use automated metrics that don't actually capture whether the notes are clinically accurate or useful. Just because an AI note looks similar to a human-written one doesn't mean it's captured all the important medical details correctly.
Safety Gaps: There's no standardized way to measure "hallucinations" - when AI makes up medical information that wasn't actually discussed. This is particularly concerning in healthcare, where accuracy can be a matter of life and death.
Scarce Public Data: Only two public datasets exist for testing these systems. Without more shared data, it's difficult for researchers to develop better evaluation methods or for healthcare organizations to make informed decisions about which systems to adopt.
Why This Matters
Imagine if we started using new medical devices in hospitals without standardized ways to test their safety and effectiveness. That's essentially what's happening with AI scribes. While these tools show immense promise in reducing physician burnout and improving patient care, we need robust ways to evaluate them before they become ubiquitous in healthcare.
The stakes are high. A single incorrect detail in a medical note could lead to wrong treatments or missed diagnoses. An AI that works perfectly for primary care visits might fail completely in complex emergency department scenarios.
The Path Forward
There are a few common sense next steps that apply to ambient scribes as well as most healthcare AI evaluation approaches:
Develop Standardized Metrics: We need a comprehensive evaluation framework that combines automated metrics with clinical quality measures. This should include specific tests for hallucinations, clinical accuracy, and usability.
Create More Public Benchmarks: The field needs more publicly available datasets across diverse clinical settings. This will enable better testing and comparison of different systems.
Focus on Clinical Value: Evaluation methods should prioritize measuring what matters most to doctors and patients - accuracy, usefulness, and safety.
What's Next?
The good news is that many brilliant minds are working on this challenge. Organizations like the Coalition for Healthcare AI (CHAI) are developing guidelines for evaluating healthcare AI tools. Several companies are also creating novel metrics specifically designed for measuring AI scribe performance.
But we need more collaboration between AI researchers, clinicians, and healthcare organizations to develop standardized evaluation approaches. Only then can we ensure that these promising tools actually deliver on their potential to transform healthcare documentation.
As these AI scribes become more common in medical practice, getting this right isn't just about better technology - it's about better healthcare for everyone. The future of medical documentation might be AI-assisted, but only if we can properly measure and ensure its effectiveness and safety.
How do you evaluate AI scribes in your practice? What’s the best (or worst!) metric you’ve heard?
Dr. Sarah Gebauer is a physician-researcher focused on the intersection of artificial intelligence and clinical practice. Her work centers on developing and evaluating AI tools that support healthcare providers while maintaining the highest standards of patient care.
Sarah, your sabbatical didn't dull the clarity of your thinking. It's gratifying to see this analysis of the situation.
I sometimes pause to marvel over how fast we've gone from the AI-scribe-as-baffling-novelty to AI-scribe-as-cheap-commodity. On the supply side, vendors see and understand what it means: table stakes are growing by the week. Yet, on the demand side (in clinics and hospitals), the awareness gap is widening... between those who deeply understand the frenzy of innovation, and those who are not thinking about this at all.
As a vendor who incorporates the AI scribe into a much larger context, we need metrics that reflect that larger context. For QiiQ, it shouldn't be just about what the AI scribe can do. Instead, it must be about what an agentic network can do - along the patient's entire journey, in and out of the clinic.
We can't be the only innovator who must measure efficacy against this large a canvas.
How do we develop these metrics, Sarah?