Skip to content
  • About
  • Contact
  • Contribute
  • Book
  • Careers
  • Podcast
  • Recommended
  • Speaking
  • All
  • Physician
  • Practice
  • Policy
  • Finance
  • Conditions
  • .edu
  • Patient
  • Meds
  • Tech
  • Social
  • Video
    • All
    • Physician
    • Practice
    • Policy
    • Finance
    • Conditions
    • .edu
    • Patient
    • Meds
    • Tech
    • Social
    • Video
    • About
    • Contact
    • Contribute
    • Book
    • Careers
    • Podcast
    • Recommended
    • Speaking

The promise and perils of AI in health care: Why we need better testing standards

Max Rollwage, PhD
Tech
July 3, 2025
Share
Tweet
Share

There is substantial enthusiasm around the advancements of AI in health care, as exemplified by the media attention for the new HealthBench release from OpenAI and similar recent studies from Google (MedPalm2, AIME). Market enthusiasm often leads us to believe that we’re on the brink of AI doctors treating patients worldwide. However, while these developments represent interesting technical progress, they fall short of demonstrating readiness for clinical application.

The stakes in health care AI are extraordinarily high. When individuals seek AI guidance for urgent health concerns—like a baby struggling to breathe or a grandparent showing stroke symptoms—the advice must be unequivocally safe and accurate. Unfortunately, current methods for testing AI’s clinical readiness are often insufficient and circular.

These “breakthrough” studies typically face several critical limitations:

  • They test AI performance on artificial or simulated patient cases rather than real patient interactions.
  • They evaluate responses using automated AI evaluations instead of human expert assessment.
  • They lack proper evaluation of patient outcomes from clinical AI interactions.

Take HealthBench, for example, which uses 5,000 handcrafted scenarios to test AI clinical agents. While this represents progress toward wider coverage of test scenarios, these artificial scenarios likely fail to capture the true complexity of real-world patient presentations. Furthermore, when companies create their own testing scenarios, it’s impossible to verify whether these cases truly represent the full spectrum of medical situations or inadvertently favor their models’ capabilities.

Perhaps more fundamentally, benchmarks like HealthBench often use AI to evaluate the clinical appropriateness of other AI responses. This creates a problematic circular logic: We’re using AI to validate AI’s fitness for clinical use, essentially trusting its evaluative capabilities before proving their safety in high-stakes environments. At this stage, only human expert evaluations can provide appropriate ground truth for clinical performance. It’s important to ask the question: Are leading AI developers applying the necessary rigor in this crucial evaluation process?

The final test of any clinical tool lies in its impact on patient outcomes. This requires rigorous clinical trials that track how patients fare when the tool is used in their care—particularly their long-term recovery and health outcomes.

The current approach for clinical AI agents is akin to claiming a new drug is safe based solely on computational models of its molecular interactions (e.g. AlphaFold), without conducting comprehensive clinical trials. Just as drug development demands rigorous human testing to prove real-world safety and efficacy, AI intended for clinical use requires far more than AI-driven simulations.

The steps toward safe deployment of clinical AI agents require substantially improved testing frameworks and are likely to require more time and effort than anticipated by leading AI labs.

To truly safeguard patients and build trust, we must fundamentally elevate AI testing standards through:

  • Real-user interactions: Testing models with genuine clinical presentations from actual users.
  • Expert human evaluation: Having qualified clinicians assess the quality, safety, and appropriateness of AI responses.
  • Impact assessment: Conducting experimental, randomized studies to evaluate the tangible impact of AI interactions on user understanding, decisions, and well-being.

However, there are players in the space that take rigorous testing of clinical AI very seriously and make substantial progress. Government organizations like the FDA, and the U.S. and U.K. AI safety institutes are tasked with creating guidelines for how to test the safety of AI in clinical applications. For example, the FDA’s guidelines starkly differ in their recommendations of how to test AI for clinical fitness [1] compared to what large leading labs claim to be sufficient. At the same time, the U.S. and U.K. AI Safety Institutes and applied clinical AI companies are working to improve clinical AI testing validity by creating more appropriate benchmarks and understanding how large language models truly affect user well-being when used for medical purposes.

AI is still in its infancy, and only by embracing stringent, real-world testing can clinical AI mature responsibly. As with drug and therapeutics development, medical devices, and other clinically impactful options, AI needs to be thoroughly tested before it is allowed to stand alone in a clinical setting.

It’s the only path to developing AI models that are genuinely safe, effective, and beneficial for patient care, moving beyond theoretical benchmarks created by technologists to proven clinical utility administered by clinicians.

Max Rollwage is a health care executive.

ADVERTISEMENT

Prev

From burnout to balance: a neurosurgeon’s bold career redesign

July 3, 2025 Kevin 0
…
Next

The gift we keep giving: How medicine demands everything—even our holidays

July 3, 2025 Kevin 0
…

Tagged as: Health IT

Post navigation

< Previous Post
From burnout to balance: a neurosurgeon’s bold career redesign
Next Post >
The gift we keep giving: How medicine demands everything—even our holidays

ADVERTISEMENT

Related Posts

  • Why the health care industry must prioritize health equity

    George T. Mathew, MD, MBA
  • Bridging the rural surgical care gap with rotating health care teams

    Ankit Jain
  • What happened to real care in health care?

    Christopher H. Foster, PhD, MPA
  • To “fix” health care delivery, turn to a value-based health care system

    David Bernstein, MD, MBA
  • Health care’s hidden problem: hospital primary care losses

    Christopher Habig, MBA
  • Melting the iron triangle: Prioritizing health equity in dynamic, innovative health care landscapes

    Nina Cloven, MHA

More in Tech

  • The digital divide in rural health care

    Jason Griffin, MBA
  • One doctor’s journey to making an AI study tool less corrosive to critical thinking

    Mark Lee, MD
  • Is it time to embrace augmented empathy while using artificial intelligence in health care?

    Vanessa D‘Amario, PhD & Vijay Rajput, MD
  • AI in your health care: a double-edged digital disruptor

    Alan P. Feren, MD
  • Why the future of AI in medicine is patient-facing

    Colin Son, MD
  • Digital mental health’s $20 billion blind spot

    Ronke Lawal
  • Most Popular

  • Past Week

    • Rethinking the JUPITER trial and statin safety

      Larry Kaskel, MD | Conditions
    • The silent disease causing 400 amputations daily

      Xzabia Caliste, MD | Conditions
    • The measure of a doctor, the misery of a patient

      Anonymous | Physician
    • Why physician leadership should be taught from day one of medical school

      Leon Moores, MD | Physician
    • What Paige Bueckers’s historic rookie season can teach doctors

      Devika Rao, MD | Physician
    • The smart way to transition to direct care

      Dana Y. Lujan, MBA | Policy
  • Past 6 Months

    • Rethinking the JUPITER trial and statin safety

      Larry Kaskel, MD | Conditions
    • The ignored clinical trials on statins and mortality

      Larry Kaskel, MD | Conditions
    • How one physician redesigned her practice to find joy in primary care again [PODCAST]

      The Podcast by KevinMD | Podcast
    • Why doctors must fight for a just health care system

      Alankrita Olson, MD, MPH & Ashley Duhon, MD & Toby Terwilliger, MD | Policy
    • The backbone of health care is breaking

      Grace Yu, MD | Physician
    • How new loan caps could destroy diversity in medical education

      Caleb Andrus-Gazyeva | Policy
  • Recent Posts

    • The smart way to transition to direct care

      Dana Y. Lujan, MBA | Policy
    • 7 things no one tells you about being a caregiver for someone with Alzheimer’s

      Andrew Gulbis, MD | Conditions
    • Bearing witness to the gun violence epidemic

      Michelle Weiss | Policy
    • A doctor’s struggle with burnout and boundaries

      Humeira Badsha, MD | Physician
    • Why physicians with ADHD are struggling with burnout despite success [PODCAST]

      The Podcast by KevinMD | Podcast
    • Unhooking from the ego in medicine

      Tammie Chang, MD | Physician

Subscribe to KevinMD and never miss a story!

Get free updates delivered free to your inbox.


Find jobs at
Careers by KevinMD.com

Search thousands of physician, PA, NP, and CRNA jobs now.

Learn more

Leave a Comment

Founded in 2004 by Kevin Pho, MD, KevinMD.com is the web’s leading platform where physicians, advanced practitioners, nurses, medical students, and patients share their insight and tell their stories.

Social

  • Like on Facebook
  • Follow on Twitter
  • Connect on Linkedin
  • Subscribe on Youtube
  • Instagram

ADVERTISEMENT

ADVERTISEMENT

  • Most Popular

  • Past Week

    • Rethinking the JUPITER trial and statin safety

      Larry Kaskel, MD | Conditions
    • The silent disease causing 400 amputations daily

      Xzabia Caliste, MD | Conditions
    • The measure of a doctor, the misery of a patient

      Anonymous | Physician
    • Why physician leadership should be taught from day one of medical school

      Leon Moores, MD | Physician
    • What Paige Bueckers’s historic rookie season can teach doctors

      Devika Rao, MD | Physician
    • The smart way to transition to direct care

      Dana Y. Lujan, MBA | Policy
  • Past 6 Months

    • Rethinking the JUPITER trial and statin safety

      Larry Kaskel, MD | Conditions
    • The ignored clinical trials on statins and mortality

      Larry Kaskel, MD | Conditions
    • How one physician redesigned her practice to find joy in primary care again [PODCAST]

      The Podcast by KevinMD | Podcast
    • Why doctors must fight for a just health care system

      Alankrita Olson, MD, MPH & Ashley Duhon, MD & Toby Terwilliger, MD | Policy
    • The backbone of health care is breaking

      Grace Yu, MD | Physician
    • How new loan caps could destroy diversity in medical education

      Caleb Andrus-Gazyeva | Policy
  • Recent Posts

    • The smart way to transition to direct care

      Dana Y. Lujan, MBA | Policy
    • 7 things no one tells you about being a caregiver for someone with Alzheimer’s

      Andrew Gulbis, MD | Conditions
    • Bearing witness to the gun violence epidemic

      Michelle Weiss | Policy
    • A doctor’s struggle with burnout and boundaries

      Humeira Badsha, MD | Physician
    • Why physicians with ADHD are struggling with burnout despite success [PODCAST]

      The Podcast by KevinMD | Podcast
    • Unhooking from the ego in medicine

      Tammie Chang, MD | Physician

MedPage Today Professional

An Everyday Health Property Medpage Today
  • Terms of Use | Disclaimer
  • Privacy Policy
  • DMCA Policy
All Content © KevinMD, LLC
Site by Outthink Group

Leave a Comment

Comments are moderated before they are published. Please read the comment policy.

Loading Comments...