What Does ‘Good’ Look Like? A Rubric for Healthcare AI Governance

The Life-Changing Magic of Knowing What You're Looking For

Jun 24, 2025

I was shocked the first time I was given a grading rubric in my freshman English composition course. It felt like cheating—you’re actually going to tell me how to get a good grade on a paper? I couldn’t believe it.

I’d gone to a not-so-great high school, where getting an A on a paper involved a complex alchemy of guessing what the teacher wanted, choosing a vaguely reasonable topic, and badly imitating writers I admired. (Because this was not a great school, I still got As.)

Having a rubric was a lifeline when I joined my more-prepared classmates in undergrad—many of whom had gone to prep schools where someone had presumably taught them how to write. It made my knowledge of the teacher less relevant, which was a relief, and it leveled the playing field. I actually learned how to write by using those rubrics.

Rubrics: From red ink to standardized fairness

The word “rubric” comes from the Latin rubrica, meaning red ink. Some of the earliest versions were actually inspired by wine-tasting assessments.

Did you know that possible tastes include distinctions between black olives and green olives, and artichoke as a possible flavor?

In education, rubrics gained traction in the 1990s alongside the standardization push of the No Child Left Behind Act.

By showing what “good” looks like, rubrics made grading fairer and more teachable. And in my case, they made writing not just less mysterious—but actually doable.

The Need for Rubrics in Healthcare AI: We’re asking questions, but we’re not describing what a good answer looks like

This same issue exists right now in healthcare AI. We’re asking vendors to respond to governance questionnaires before implementation—but we don’t have a rubric for evaluating their answers.

Without clear guidance on what counts as a “good” response, vendors are left guessing. Their responses can vary widely, even within the same company, depending on who’s filling out the form. Governance committees, meanwhile, often end up debating whether an answer is acceptable—without shared standards. Favoritism creeps in. So does inefficiency.

Just like in the classroom, the absence of a rubric makes it harder to be fair, and harder to get better.

Medicine Uses Rubrics. Why Doesn’t AI Governance?

The real kicker is: we already know how to do this!

It’s not like medicine doesn’t use rubrics. The OSCE, which every practicing physician after the early 2000s has gone through, has a detailed scoring rubric. The template at Texas Tech Med even uses the Latin-inspired red ink:

Some institutions are now using AI to grade the written portion of the OSCE. And similarly, we could use a rubric to evaluate vendor responses to AI governance questionnaires. A well-structured rubric would reduce the burden on governance committees, support internal alignment, and encourage vendors to submit higher-quality responses from the start.

What Comes After the Rubric? Pass/Fail Logic and Risk Thresholds

Once rubric scores are in hand, we need a clear sense of what counts as passing. Is a tool acceptable if it scores above a certain threshold within a category like Bias and Fairness? Or does each question need to meet a minimum bar?

This isn't just a grading exercise—it's a risk triage system. If we treat it that way, we can make better decisions, faster, and with more confidence.

Introducing the Validara AI Governance Scoring Rubric

Translating thoughtful technical work into a governance response that resonates with clinicians, regulators, or oversight teams isn’t easy.

Most governance questions require a combination of clinical insight, operational context, and careful documentation—not just technical accuracy. But we’ve had no shared format for how to communicate that clearly.

In medicine, we don’t rely on gut feeling to assess the readiness of a clinician to “deploy” into a patient care setting. We use structured evaluations. Why should AI governance be any different?

So I built the rubric I wish we had.

This governance rubric is designed to bring structure and transparency to the messy process of evaluating AI tools for clinical deployment. It doesn’t reinvent the wheel. It draws from existing standards like CHAI, NIST, and ISO, but turns them into something usable. Something that helps you distinguish between “we care about fairness” and “here are the results for subgroup performance.”

The rubric includes common governance questions, as well as sample weak, moderate, and strong answers for each question

The rubric covers over 40 questions in the following domains:

Product Overview
What the tool does, how it fits into the workflow, and why AI is being used at all.
Clinical
Accuracy, risk handling, clinician oversight, explainability, edge cases, and clinical evolution.
Bias and Fairness
Subgroup performance, mitigation strategies, and how demographic data is used and validated.
For Patient-Facing Tools
Privacy, consent, patient messaging, and alignment with accessibility standards.
Healthcare User Experience
Training, usability, override pathways, and integration with existing systems.
Governance and Frameworks
Alignment with external standards (e.g., CHAI, NIST), feedback mechanisms, and ongoing monitoring.
Technical
Version control, data lineage, model testing, update processes, and incident response.

How to use the rubric

For Vendors

AI vendors can use the rubric to:

Internally review their own governance materials before submitting them to a health system or investor. This helps teams catch vague language or missing components, and could shorten the implementation cycle.
Iterate on aspects of the product during research and development cycles.

For Health Systems

Health systems can use the rubric to:

Consistently evaluate AI tools and vendor responses.
Clarify internal areas of priority, such as which scores on which questions constitute a “hard stop” prior to implementation
Train new AI governance committee members

Scoring and Decision-Making

Each response is scored on a simple 0–3 scale:

0 - Not mentioned: The response doesn’t correspond to the question
1 – Needs Revision: Vague, generic, or missing key components
2 – Adequate: Covers most requirements but could be clearer or more complete
3 – Strong: Clear, specific, and demonstrates alignment with best practices

Some organizations may choose to:

Use cumulative scores across each domain to flag overall readiness
Require a minimum score in key areas like Clinical Risk or Bias and Fairness
Weight domains differently depending on risk tolerance or tool use case (e.g., higher standards for patient-facing tools)

My goal is to make evaluation easier, clearer, and more aligned with real-world safety and clinical integrity.

Because once we start using rubrics, we can stop wasting time debating whether an answer “feels good enough” and start asking the better question: Does this meet our threshold for safe, equitable, clinically sound AI?

Get in touch on the Validara Health website about how the full rubric can help you streamline your product cycle, de-risk clinical deployment, and get your team aligned on what "ready" really means.

What’s Next: Making the Frameworks Work for Us

Of course, rubrics don’t exist in a vacuum. They’re only as good as the questions they’re scoring. And right now, those questions are scattered across dozens of frameworks—some long, some confusing, and many overlapping without alignment.

In my next post, I’ll walk through three of the most widely used governance frameworks, NIST RMF, CHAI’s Responsible AI Guide, and RAISE 3, and compare what they cover (and what they miss).

Subscribe to get the next post as soon as it’s out.

Machine Learning for MDs by Sarah Gebauer, MD

Discussion about this post