The dissolution of OpenAI’s safety team last week was big news, as one of the company’s founding principles was to ensure safe AI use. Although the safety team is apparently being subsumed into a larger alignment team (and they have some really smart people working on safety still!), the safety leaders who resigned felt that safety “was taking a backseat to shiny products”.
Other companies with a core mission of safety, which theoretically includes all healthcare AI, may now feel less pressure to have a safety focus.
But healthcare AI companies have a greater responsibility to research and report the safety of their products than general purpose foundation models do, so we don’t have to worry about that…right?
Safety in healthcare AI
How many healthcare AI companies have safety teams, policies, public safety benchmarks, or research teams devoted to ongoing safety monitoring?
The short answer: some have portions of these, some have none.
The long answer: find out in the next Substack!
Admittedly, safety looks very different in foundation models than it does in healthcare AI; the idea with foundation models is to identify where AI systems might increase risks as they increase in power and capability. These risks include deception and toxic content as well as cybersecurity. Although they signed a voluntary commitment, the foundation model companies don’t currently have any true governmental body checking to make sure their safety-related claims are true.
Patient safety in AI is not just one thing
In healthcare, the focus is more on efficacy - does the product work, with somewhat less focus on ensuring that the AI systems don’t introduce patient harm to the existing infrastructure. Some healthcare AI companies have HHS, the ONC, and/or the FDA to answer to for aspects of safety and efficacy.
However, there are many aspects of AI safety rolled into the concept of “patient safety”.
Clinicians intuitively understand that you can’t look at just one metric to determine patient safety. In case you didn’t grasp that, I’m willing to bet that you’ve been shown a graphic of Swiss cheese many, many times to hammer that point home. We know that a patient can be harmed by not getting the correct antibiotic at the right time, falling due to faulty bed alarms, or missing a crucial diagnosis. Testing whether a health system is ‘safe’ involves many metrics from diverse parts of the hospital, measured in different ways. Similarly, measuring if an AI system is safe within a health system involves evaluating that system in multiple ways.
Ensuring that an AI system is safe for patients likely requires evaluating multiple aspects of an AI systems over time.
Understanding the “Swiss Cheese Model” and Its Application to Patient Safety - PMC
The many aspects of healthcare AI safety
Understanding the aspects of safety that each healthcare AI product affects is crucial to determining how safety should be evaluated. Healthcare AI is now basically just “healthcare”, since so many companies incorporate AI (or at least claim to). The graphic below from The Medical Futurist shows the range of digital health and AI companies; the scope is enormous.
As you can imagine, each field, and even each product, will have a unique set of safety concerns. Diagnostic software will need to be evaluated for diagnostic bias, while ambient scribes will need to be assessed for transcription accuracy, which are very different machine learning and software issues.
Adding holes to the Swiss cheese
Evaluating AI systems that affect patient care shouldn’t stop with just demonstrating that the systems do what they say they’re going to do; they also need to show that they’re not introducing additional risks in the process. New risks from things like clinician-AI interaction (increasing clinician cognitive errors by decreasing the cognitive energy a clinician spends actually thinking), or connectivity issues for mobile devices, need to be actively assessed.
What would help hospital leaders evaluate AI systems?
Benchmarks
For any given healthcare AI product, there are likely multiple aspects to safety evaluations, most of which lack public benchmarks. This means that the data hospital leaders receive is prone to cherry-picking of metrics. It’s also hard to compare similar healthcare AI products, since each company will show their best performance on tasks that their system is good at. Having a standard benchmark like the MMLU for foundation models would be enormously helpful in distinguishing differences in similar products.
Clarity about what aspects need to be evaluated
Moreover, there aren’t clear guidelines for what services or features require specific kinds of evaluation. A specific AI system may need evaluations of bias, patient experience, and clinician-AI system interaction, while another system may need diagnostic accuracy and patient re-identification evaluations. Standardizing categories of evaluations for different types of AI systems would help hospital leaders ask the right kinds of questions.
Summary
The recent change in safety infrastructure at OpenAI demonstrates that even companies that start with safety as a primary value can see that shift with market forces. That same shift in focus can easily happen in healthcare AI. Clinicians and hospital systems must be vigilant about the need to evaluate risks in a meaningful and systematic way in order to provide safe care for patients.