Machine Learning for MDs Weekly Digest
The mission of ML for MDs is to connect physicians interested in machine learning. This newsletter provides learnings at the intersection of medicine and machine learning.
Fun Fact
Thomas Bayes was a clergyman from England in the 1700s. His “Bayes Theorem”, which described a way to predict probability when certain variables were unknown, was published two years after his death.
AI Words of the Week
Accuracy - how many predictions were correct
Recall - sensitivity
Precision - proportion of true positives amongst all positives
F1 score - measure of accuracy that balances recall and precision
Concept of the Month: AI Performance
This month we’re going to look at measures of AI performance, how they’re used in AI and medical journal articles, as well as in real-life use cases.
How many alerts for drug interactions in your EHR have you scrolled past in the past week? Can you even count them? How many of them did you deeply consider? How many did you even read? If you’re like most doctors, you ignored 49-96% of computerized safety notifications.
These safety notifications are ostensibly a good concept, but aren’t nearly as useful as they could be. Physicians have long been used to imperfect tools, and to using their judgment in deciding which of these tools to prioritize.
For example, many of the commonly used scoring systems have pretty mediocre sensitivity and specificity, even for clinical situations with life and death consequences:
The NEWS2 framework for escalation of care in a hospital has a sensitivity and specificity of 0.82 and 0.67
The Mallampati score for prediction of a difficult airway has a sensitivity of 40-80% and specificity of 50-85%, with a positive predictive value of 5-20%.
In fact, EKGs misinterpreted by computers are estimated to cause 10,000 adverse events or deaths a year worldwide. They are particularly bad at rhythm disorders:
Positive predictive accuracy for non-sinus rhythms of 53%
Atrial fibrillation was misdiagnosed 11% of the time
They’re also not great at STEMI diagnosis: one study found sensitivity between 62-69% and a specificity of 89-95%. Diagnosis of non-STEMI occlusive myocardial infarction is also notoriously poor.
The introduction of AI into a physician’s bag of tools is no different. The performance of AI tools will likely be much better than previous computer-aided diagnostic systems, but they won’t be perfect, and physicians won’t expect them to be.
But with the hype around AI and the increased complexity of the algorithms, the imperative for physicians to understand how AI performance is calculated and reported.
Performance of AI systems is heavily reliant on knowing the “right”, or ideal, answer. This is a lot easier with straightforward questions that AI algorithms are often tested on:
Is this image a cat or dog?
Is this the correct answer to a math problem?
Is this the correct translation from English to Spanish?
Many questions in medicine, though, are not nearly as straightforward:
Was the clinical decision support useful to the clinician?
Did the chatbot respond in a culturally appropriate manner?
Which patients should be prioritized for extra resources?
Common measures of AI performance
Performance of AI tools are (generally) reported differently than the sensitivity, specificity, PPV, and NPV physicians are used to seeing. The most common AI performance metrics are:
Accuracy
Recall
Precision
F1 score
Area Under the Receiver Operating Curve (AUC) is often also reported, which shows how well the model can distinguish between classes, like patients with disease vs patients without disease. It has the sensitivity on the Y-axis and (1-specificity) on the Y-axis.
Before you can calculate these scores, keep in mind that AI performance is usually measured from key performance indicators (KPIs) which are the main tasks you want the AI to accomplish. These can be anything - how often an autonomous car goes over the white line, or how many questions were answered correctly. In medicine, these can be harder, since some disease states are clinical diagnoses and the key performance indicator may be based on a physician review of the chart, which has its own error rate.
The Confusion Matrix
I’m not sure I’ve ever encountered a more appropriate name for a statistical tool. This is the 2x2 matrix you came to know and love in medical school:
Actual values
Predicted values
True Positive (Has cancer, test says has cancer)
False positive (Doesn’t have cancer, test says has cancer)
False Negative (Has cancer, test says doesn’t have cancer)
True negative (Doesn’t have cancer, test says doesn’t have cancer)
Accuracy
Accuracy is how many correct predictions your model made: the number of “correct answers” divided by all the answers.
In practice, accuracy is the most commonly measured metric in AI models. But accuracy can be misleading, because:
It can lead to metric hacking: The model can be run in “easy” circumstances, which will lead to a higher number
The severity of the mistakes can be ignored in the accuracy measurement, so that a potentially life-threatening error is treated the same as a very minor one
In imbalanced data sets, (where the “positive case” like cancer is much less common than the negative case), you can have high accuracy for the more common (negative) metric, but low accuracy for the metric that you’re actually interested in (cancer)
These issues will lead to a very accurate model being significantly less useful in the real world than expected.
Useful questions to consider when you hear the accuracy of an AI model
What’s the benchmark/what metric(s) make this model’s outcomes count as “correct”?
Is that benchmark something that’s clinically important?
How subjective is that benchmark?
Is there information being reported on the potential shortcomings of that benchmark?
Is there any metric hacking?
What kinds of tasks was it tested on? Are those representative of real practice?
When errors occur, what is the severity/possible consequence of the incorrect answer?
Is there an alternative or complementary way to measure this model that might give a clearer picture of its performance?
What information would make the performance metric of the model more meaningful?
Recall
Great news about recall - it’s the same as sensitivity, so it’s something you’re used to looking at. Why does it have a different and more confusing name? No idea.
You’ll recall (ha!) that sensitivity measures how many true positives (TP) are predicted out of all the actual positives. Recall is the fraction of patients who were correctly diagnosed with a disease among all patients who actually had that disease.
Precision
Precision is the proportion of true positives among all positive results, or the fraction of patients who were correctly diagnosed with a disease among all patients who were diagnosed with that disease. With high precision models, you are confident that any sample predicted as positive is true. Note: it’s the opposite of NPV. Precision is closely related to recall/sensitivity; it generally goes up if the recall goes down.
Precision = TP/(TP+FP)
F1 score
F1 score is the “harmonic mean” of precision and recall, and is another way to measure accuracy that balances precision and recall. High is good, low is bad.
Next week we’ll look through some examples of papers to help solidify these concepts and help us evaluate some of the new AI models.
Community News
If you haven’t introduced yourself, please do so under the #intros channel.
Thanks for being a part of this community! As always, please let me know if you have questions/ideas/feedback.
Sarah
Sarah Gebauer, MD