AI in Physician Clinical Performance Evaluation
Because there has to be a better way than what we're doing now
This week we’re going to look more closely at AI use in trainee and physician clinical evaluation. I consider this topic separate from physician quality evaluation, which is heavily dependent on charting and often unrelated to actual clinical performance.
To reality-check these ideas, I’ve asked Dr. Karim Hanna, an expert on AI in medical education to comment on these emerging technologies. You’ll see his answers in italics, and you can sign up for his great AI+MedEd content here.
AI Clinical Assessment in Medical School
Innovative medical educators use AI before clinical training begins for curriculum design, tailored learning, and testing in the basic sciences. The University of Minnesota is using AI for medical student clinical skills assessment that includes “applying AI to video recordings of physician-patient interactions and developing algorithms to analyze these encounters, measuring things like nonverbal communication, empathy and eye contact”. When used to score standardized patient interactions like those in the USMLE Clinical Skills exam, the medical educators say that the AI system “doesn’t make errors” and “improves the reliability and validity of the assessments.” (Note: I’d be concerned about AI bias; I’m unsure what bias testing has been completed for this system). AI can also be used to score clinical notes.
As an experienced medical educator, I've struggled with the standardized patient encounter, as these actors don't get a lot of prep before going in with a new case script. A 180-student class OSCE with 10 different actors is less than ideal. AI would bring some more regulation to the process, particularly in grading. Trust me, I’ve been on the side of watching these seemingly endless replays of “the CHF case”: recently worsening DOE, sleeping on 3 pillows, with increased BL leg swelling. This not only provides invaluable feedback to students but also offers an objective assessment method.
AI Clinical Assessment in Residency
Some residency programs use AI to evaluate applicants. Several frameworks exist to evaluate procedural skills in trainees including the Global Operative Assessment of Laparoscopic Skills and the Objective Structured Assessment of Technical Skill, but they require many hours of expert-surgeon time to review and are susceptible to bias.
I agree that AI can be a valuable tool for evaluating applicants. I imagine the systems used in the above example at Minnesota could be utilized in interviews of applicants too – both for medical school and residency. The evaluation of technical skills also hold promise for more efficient and objective evaluations of procedural skills, potentially benefiting both applicants and programs alike.
One question I return to repeatedly is: where do we draw the line between how "good" an applicant is currently vs. how teachable they are? Do I want to recruit an intern who is better than others with a laparoscope? Or do I want to recruit an intern with a growth mindset, who will perhaps fill my cohort with other important skills a resident should carry? An individual winning the NBA skills challenge on All-Star weekend does not guarantee your team wins the finals. Striking this balance is key - we all know what happened to Anakin Skywalker.
Evaluation of Practicing Physicians
Since half of physicians are below average, identifying and improving clinical skills would be a huge benefit to patients.
The Joint Commission requires evaluation of physician clinical skills via:
Focused Professional Performance Reviews (FPPEs)
Physicians new to the system
New privileges (ie, if they request privileges for caring for ICU patients but haven’t done so in several years, a peer will review the chart)
When clinical skills are brought into question.
These are done mainly via chart review, though occasionally a peer proctor is required.
Ongoing Professional Performance Evaluations (OPPEs)
hospital-dependent set of metrics that have some clinical quality metric but often include non-clinical tasks like chart completion
There have been some spectacular failures of this approach, including the highly publicized Doctor Death series. These peer reviews are not mandated to be performed by physicians, and 62% of hospitals say they don’t consider their peer review system to be standardized. The concept has been widely criticized for lack of reliability, bias, and lack of efficacy in improving care. A major issue is that doctors reviewing a chart can’t really determine the level of care provided; one study found that “physician agreement regarding quality of care is only slightly better than the level expected by chance”. My guess is that you’re not surprised by this. There’s so much that goes into clinical care that’s not captured in the chart: communication, collaboration, and the actual caring.
Now that we’ve established that we have a terrible system for evaluating physicians’ clinical skills, integrating AI into the system seems like an obvious next step.
AI Evaluation of Procedural Skills
AI in procedural skills teaching and assessment has huge opportunity. For example, one study of neurosurgeons and trainees had 50 participants (14 attendings, 24 trainees, and 12 students) resect a cranial tumor in a virtual reality simulation. The AI algorithm was able to distinguish the participant’s level of training with 90% accuracy. Another study demonstrated 78% recall for surgical tasks based on video clips of procedures.
Imagine a surgeon who could participate in several VR trainings, or even better, send video of actual surgeries performed. These could be analyzed by AI to ensure procedural skills were at a minimum standard. Of course there are myriad possible ways this could go wrong, but I’d argue that our current system of testing is a disservice to patients since it doesn’t address what patients actually want to know: is this person a good surgeon?
AI Evaluation of Communication and Reasoning
My expectation is that physician documentation will improve markedly with ambient scribes, and AI-NLP-assisted peer review would then have more robust information on which to evaluate physicians. Ambient scribes may also be able to track clinical reasoning by evaluating what questions a physician asks patients and communication skills from patient conversations.
Clinical Coaches Instead of Board Recertification Testing
Several medical boards have revamped how they approach recertification exams, recognizing that a test every 10 years about details you haven’t used in actual practice wasn’t the most helpful approach. My own board, the American Society of Anesthesiologists, recently switched to quarterly questions on an app plus documentation of quality improvement projects. But there’s a growing movement against what many physicians see as an ineffective and costly process.
Atul Gawande wrote eloquently in the New Yorker about how having a surgical coach helped him, and makes the case that if physicians want to practice at their best, they need coaches just like singers, athletes, and musicians. Of course, identifying an expert in your exact field and having him or her come watch you operate is not practical for most physicians. However, AI will likely be able to suggest hand movements to surgeons or give feedback on difficult conversations to palliative care physicians. If instituted correctly, I think many physicians would be open to a system that makes meaningful suggestions to clinical care but doesn’t judge them.
Downsides to AI Clinical Performance Assessment
Lack of explainability was identified in one study as a possible barrier to use; after all, you have to be able to explain to a trainee or physician why they didn’t pass the assessment. This lack of explainability also has a legal aspect, as the EU’s recent AI regulations categorize actions “intended to be used for making decisions on promotion and termination of work-related contractual relationships, for task allocation and for monitoring and evaluating performance and behaviour of persons in such relationships” as ‘high-risk’ use of AI. Although similar regulations don’t yet exist in the US, it isn’t difficult to imagine legal cases from physicians who were required to be monitored to determine their clinical performance. Physician guidance will be crucial to ensure that any clinical evaluation system focuses on improvement and isn’t punitive.
Special thanks to Dr. Hanna for his insightful contributions this week.
This month we’ve covered:
Overview of AI in Performance and Productivity Evaluation
Productivity Evaluation in Software Developers
Next week, we’ll look at possible best and worst case scenarios for AI in physician productivity and performance evaluation
If you’re a physician, join us at the ML for MDs Slack group, where we share resources and knowledge about the intersection of AI in healthcare.
Super post. Very interesting to see how AI is not only affecting the healthcare industry but the process of admissions for future doctors. FYI loved the message you conveyed using Anakin Skywalker, its a truly challenging question. I for one prefer applicants who are teachable.