Several years ago, I was given a new tool to help me with my clinical work. It was supposed to make my life easier, and I was looking forward to offloading some of the less interesting daily tasks. It had been taught how to summarize a patient’s history, create an assessment and plan, and write orders. I was told this tool would be able to make simple decisions about patient care on its own and know when to ask for help with more complex issues.
The downside was that his tool had some basic knowledge but had never been tested in a real situation before. I had the same kind of tool with me most of the time but the tool itself changed every few days. This made my work more difficult because the tools were variable in performance and had unpredictable knowledge gaps. Part of my job was to train it to be completely autonomous over a period of 3-5 years, which was challenging with the constantly rotation and the lack of dedicated time for Reinforcement Learning with Human Feedback. It could also only function for a pre-specified number of hours per week before it shut down, and cost about $60,000 per year each.
One of the hardest parts of using this tool was that I never knew if it was giving me correct information. Sometimes it would hallucinate information about a patient, though it would deliver the information very convincingly. Other times it would give me correct information with such great detail that I had to cut it off due to time constraints. It was eager to learn but still risky, especially at the beginning; I knew that more of its high-risk patients were likely to have worse outcomes when it started. I was also legally liable for any errors the tool made.
I wanted to take great care of my patients so found myself verifying much of the information the tool provided. However, it was impossible from a time perspective to check every detail that the tool provided in real time, especially since I often had multiple tools running at once. I had to be selective in deciding which information and decisions to double check. I had been trained to recognize patterns, and I caught most of the errors this way. For example, I would ask the tool what cardiology said about the lab test and realize the tool forgot to consult them. I overheard heard the tool planning to order a nephrotixic drug for a patient with severe kidney disease and prompted it to check for drug side effects next time. Many times it generated a note with incorrect or incomplete information that I had to correct.
After years of training these tools, some that I trained were excellent, some were average, and a few never reliably functioned without significant prompt engineering.
This tool, of course, is a medical intern.
Note: I hope this doesn’t seem pejorative to medical students or interns; we were all in your shoes at some point and the medical education process is a vital path to training physicians.
Information filtering as a fundamental physician competency
I’ve seen a lot of concern in popular media about physicians using AI tools that don’t have perfect performance.
What if physicians take every AI suggestion without using their clinical judgment?
I think these fears are generally pretty overblown. There are good analogies for much of the AI technology that physicians are already very comfortable using: textbooks (chatbots) and interns (AI diagnostic tools).
AI chatbots as fancy textbooks
Most of the AI-based medical chatbots are more akin to textbooks than a revolutionary tools. More recent examples include Epocrates in the 2000s, UptoDate, emergency manuals, protocols and care pathways, and clinical decision support embedded into most EHRs. Physicians are accustomed to taking information from a book or screen, integrating it into their existing knowledge, and making a decision. If AI is indeed a “printing press moment”, then AI continues a long history of physicians successfully using cognitive aids, such as when the first medical textbook was made widely available by an actual printing press.
There may be more warranted concern about physicians failing to use their clinical judgment when AI and physicians concurrently perform a task. Some studies demonstrate that AI as a concurrent reader (rather than a second reader) for radiology studies decreases diagnostic accuracy.
However, several studies note that whether a physician decides to take an AI-generated suggestion depends on multiple factors and “a radiologist’s decision how to use the system and whether to include its assessment in a final diagnosis depends largely on the result of the AI system’s classification and is driven by cognitive processes we are only beginning to understand”.
Issues of explainability, AI displaying chain of thought and its confidence level in its assessments, and physician trust in the system are all contributing factors. Just like deciding when to accept an intern’s assessment and plan, the decision about when to accept AI input is not always clear, even to the physician.
Why Ronald Reagan is the most-quoted person in the hospital
Whenever physicians dismiss AI suggestions, it seems to be a huge surprise to the non-physician community, which says,
“We are giving you this great tool that has better performance than you do! Why wouldn’t you trust it over your own judgment?”
And part of the answer is that physicians have such a long institutional history of receiving questionable information that we’re taught to rely on our judgment rather than the imperfect tools; we are taught to “trust, but verify”.
We’ve been relying on humans to provide physicians with clinical information and input for millennia, though they are far from perfect. In fact, if you were going to design an AI tool with some aspects of human performance, that product would never make it into the hospital.
Humans have a lot of inherent performance issues
In reality, humans in general are incredibly imperfect tools for many medical tasks:
We are terrible at night and when we’re tired
Sleep deprived surgeons performing laparoscopic surgery make “20% more errors and take 14% longer to complete the tasks than those who had a full night's sleep.”
A 4-hour sleep loss is equivalent to a 0.095% breath-alcohol concentration
We have a large number of cognitive biases
“Overconfidence, lower tolerance to risk, the anchoring effect, and information and availability biases [are] associated with diagnostic inaccuracies in 36.5 to 77 % of case-scenarios”
We have a large number of biases about specific patient groups including women, people of color, the elderly, and people over a certain BMI
“Providers high in implicit bias [are] less supportive of and [spend] less time with their patients than providers low in implicit bias”
We maximize for personal profit rather than patient well-being
“Reimbursement changes lead physicians to adjust treatment patterns”
Yet we rely on humans and trainees to perform many dangerous tasks in a hospital every day. This interface between humans and AI-generated suggestion is a “vibrant area of research”, and we can expect to see more studies coming out about how to optimize the AI platforms for physician interaction. Until then, we’ll rely on the error-prone but well-intentioned human-human interactions that medicine is based on. And we’ll all try not to have a heart attack in July.