Skip to content
  • About
  • Contact
  • Contribute
  • Book
  • Careers
  • Podcast
  • Recommended
  • Speaking
KevinMD
  • All
  • Physician
  • Practice
  • Policy
  • Finance
  • Conditions
  • .edu
  • Patient
  • Meds
  • Tech
  • Social
  • Video
  • All
  • Physician
  • Practice
  • Policy
  • Finance
  • Conditions
  • .edu
  • Patient
  • Meds
  • Tech
  • Social
  • Video
    • All
    • Physician
    • Practice
    • Policy
    • Finance
    • Conditions
    • .edu
    • Patient
    • Meds
    • Tech
    • Social
    • Video
    • About
    • Contact
    • Contribute
    • Book
    • Careers
    • Podcast
    • Recommended
    • Speaking
KevinMD
  • All
  • Physician
  • Practice
  • Policy
  • Finance
  • Conditions
  • .edu
  • Patient
  • Meds
  • Tech
  • Social
  • Video
    • All
    • Physician
    • Practice
    • Policy
    • Finance
    • Conditions
    • .edu
    • Patient
    • Meds
    • Tech
    • Social
    • Video
    • About
    • Contact
    • Contribute
    • Book
    • Careers
    • Podcast
    • Recommended
    • Speaking
  • About KevinMD | Kevin Pho, MD
  • Be heard on social media’s leading physician voice
  • Contact Kevin
  • Discounted enhanced author page
  • DMCA Policy
  • Establishing, Managing, and Protecting Your Online Reputation: A Social Media Guide for Physicians and Medical Practices
  • Group vs. individual disability insurance for doctors: pros and cons
  • KevinMD influencer opportunities
  • Opinion and commentary by KevinMD
  • Physician burnout speakers to keynote your conference
  • Physician Coaching by KevinMD
  • Physician keynote speaker: Kevin Pho, MD
  • Physician Speaking by KevinMD: a boutique speakers bureau
  • Primary care physician in Nashua, NH | Doctor accepting new patients
  • Privacy Policy
  • Recommended services by KevinMD
  • Terms of Use Agreement
  • Thank you for subscribing to KevinMD
  • Thank you for upgrading to the KevinMD enhanced author page
  • The biggest mistake doctors make when purchasing disability insurance
  • The doctor’s guide to disability insurance: short-term vs. long-term
  • The KevinMD ToolKit
  • Upgrade to the KevinMD enhanced author page
  • Why own-occupation disability insurance is a must for doctors

The promise and perils of AI in health care: Why we need better testing standards

Max Rollwage, PhD
Tech
July 3, 2025
Share
Tweet
Share

There is substantial enthusiasm around the advancements of AI in health care, as exemplified by the media attention for the new HealthBench release from OpenAI and similar recent studies from Google (MedPalm2, AIME). Market enthusiasm often leads us to believe that we’re on the brink of AI doctors treating patients worldwide. However, while these developments represent interesting technical progress, they fall short of demonstrating readiness for clinical application.

The stakes in health care AI are extraordinarily high. When individuals seek AI guidance for urgent health concerns—like a baby struggling to breathe or a grandparent showing stroke symptoms—the advice must be unequivocally safe and accurate. Unfortunately, current methods for testing AI’s clinical readiness are often insufficient and circular.

These “breakthrough” studies typically face several critical limitations:

  • They test AI performance on artificial or simulated patient cases rather than real patient interactions.
  • They evaluate responses using automated AI evaluations instead of human expert assessment.
  • They lack proper evaluation of patient outcomes from clinical AI interactions.

Take HealthBench, for example, which uses 5,000 handcrafted scenarios to test AI clinical agents. While this represents progress toward wider coverage of test scenarios, these artificial scenarios likely fail to capture the true complexity of real-world patient presentations. Furthermore, when companies create their own testing scenarios, it’s impossible to verify whether these cases truly represent the full spectrum of medical situations or inadvertently favor their models’ capabilities.

Perhaps more fundamentally, benchmarks like HealthBench often use AI to evaluate the clinical appropriateness of other AI responses. This creates a problematic circular logic: We’re using AI to validate AI’s fitness for clinical use, essentially trusting its evaluative capabilities before proving their safety in high-stakes environments. At this stage, only human expert evaluations can provide appropriate ground truth for clinical performance. It’s important to ask the question: Are leading AI developers applying the necessary rigor in this crucial evaluation process?

The final test of any clinical tool lies in its impact on patient outcomes. This requires rigorous clinical trials that track how patients fare when the tool is used in their care—particularly their long-term recovery and health outcomes.

The current approach for clinical AI agents is akin to claiming a new drug is safe based solely on computational models of its molecular interactions (e.g. AlphaFold), without conducting comprehensive clinical trials. Just as drug development demands rigorous human testing to prove real-world safety and efficacy, AI intended for clinical use requires far more than AI-driven simulations.

The steps toward safe deployment of clinical AI agents require substantially improved testing frameworks and are likely to require more time and effort than anticipated by leading AI labs.

To truly safeguard patients and build trust, we must fundamentally elevate AI testing standards through:

  • Real-user interactions: Testing models with genuine clinical presentations from actual users.
  • Expert human evaluation: Having qualified clinicians assess the quality, safety, and appropriateness of AI responses.
  • Impact assessment: Conducting experimental, randomized studies to evaluate the tangible impact of AI interactions on user understanding, decisions, and well-being.

However, there are players in the space that take rigorous testing of clinical AI very seriously and make substantial progress. Government organizations like the FDA, and the U.S. and U.K. AI safety institutes are tasked with creating guidelines for how to test the safety of AI in clinical applications. For example, the FDA’s guidelines starkly differ in their recommendations of how to test AI for clinical fitness [1] compared to what large leading labs claim to be sufficient. At the same time, the U.S. and U.K. AI Safety Institutes and applied clinical AI companies are working to improve clinical AI testing validity by creating more appropriate benchmarks and understanding how large language models truly affect user well-being when used for medical purposes.

AI is still in its infancy, and only by embracing stringent, real-world testing can clinical AI mature responsibly. As with drug and therapeutics development, medical devices, and other clinically impactful options, AI needs to be thoroughly tested before it is allowed to stand alone in a clinical setting.

It’s the only path to developing AI models that are genuinely safe, effective, and beneficial for patient care, moving beyond theoretical benchmarks created by technologists to proven clinical utility administered by clinicians.

Max Rollwage is a health care executive.

ADVERTISEMENT

Prev

From burnout to balance: a neurosurgeon’s bold career redesign

July 3, 2025 Kevin 0
…
Next

The gift we keep giving: How medicine demands everything—even our holidays

July 3, 2025 Kevin 0
…

Tagged as: Health IT

< Previous Post
From burnout to balance: a neurosurgeon’s bold career redesign
Next Post >
The gift we keep giving: How medicine demands everything—even our holidays

ADVERTISEMENT

Related Posts

  • Why the health care industry must prioritize health equity

    George T. Mathew, MD, MBA
  • Bridging the rural surgical care gap with rotating health care teams

    Ankit Jain
  • What happened to real care in health care?

    Christopher H. Foster, PhD, MPA
  • To “fix” health care delivery, turn to a value-based health care system

    David Bernstein, MD, MBA
  • Health care’s hidden problem: hospital primary care losses

    Christopher Habig, MBA
  • Melting the iron triangle: Prioritizing health equity in dynamic, innovative health care landscapes

    Nina Cloven, MHA

More in Tech

  • Connected health care workflows: From chore to core patient care

    Grace E. Terrell, MD, MMM
  • Physician resilience: Why systems matter more than heroism

    Harvey Castro, MD, MBA
  • Validating AI in health care: the role of real-world evidence

    Jeanna Blitz, MD
  • Iterative mindset versus AI and GLP-1s: Why shortcuts weaken the brain

    Martha Rosenberg
  • Why voicemail in outpatient care is failing patients and staff

    Dan Ouellet
  • Building a clinical simulation app without an MD: a developer’s guide

    Helena Kaso, MPA
  • Most Popular

  • Past Week

    • Why Medicare must cover atrial fibrillation screening to prevent strokes

      Radhesh K. Gupta | Conditions
    • The American Board of Internal Medicine maintenance of certification lawsuit: What physicians need to know

      Brian Hudes, MD | Physician
    • Teaching joy transforms the future of medical practice [PODCAST]

      The Podcast by KevinMD | Podcast
    • Sabbaticals provide a critical lifeline for sustainable medical careers [PODCAST]

      The Podcast by KevinMD | Podcast
    • Charles Bonnet syndrome: Why the blind see hallucinations

      Ceres Alhelí Otero Peniche | Conditions
    • When language becomes the barrier: IMGs and autism diagnoses

      Ronald L. Lindsay, MD | Conditions
  • Past 6 Months

    • What is the minority tax in medicine?

      Tharini Nagarkar and Maranda C. Ward, EdD, MPH | Education
    • Why the U.S. health care system is failing patients and physicians

      John C. Hagan III, MD | Policy
    • Alex Pretti: a physician’s open letter defending his legacy

      Mousson Berrouet, DO | Physician
    • Health care as a human right vs. commodity: Resolving the paradox

      Timothy Lesaca, MD | Physician
    • Why voicemail in outpatient care is failing patients and staff

      Dan Ouellet | Tech
    • The elephant in the room: Why physician burnout is a relationship problem

      Tomi Mitchell, MD | Physician
  • Recent Posts

    • Sabbaticals provide a critical lifeline for sustainable medical careers [PODCAST]

      The Podcast by KevinMD | Podcast
    • Curing versus caring in medicine: Bridging the gap in patient trust

      Cherie Shah | Education
    • Flexible health care funding: Moving beyond disease eradication

      Selena Kattick | Policy
    • Why a chief wellness officer hid her medication use for 13 years

      Michael F. Myers, MD | Physician
    • Physician patient advocacy: Fighting insurance denials effectively

      Neil Baum, MD | Physician
    • Health care’s Upside Down: Addressing systemic dysfunction and burnout

      Ganesh Asaithambi, MD, MBA | Physician

Subscribe to KevinMD and never miss a story!

Get free updates delivered free to your inbox.


Find jobs at
Careers by KevinMD.com

Search thousands of physician, PA, NP, and CRNA jobs now.

Learn more

Leave a Comment

Founded in 2004 by Kevin Pho, MD, KevinMD.com is the web’s leading platform where physicians, advanced practitioners, nurses, medical students, and patients share their insight and tell their stories.

Social

  • Like on Facebook
  • Follow on Twitter
  • Connect on Linkedin
  • Subscribe on Youtube
  • Instagram

ADVERTISEMENT

ADVERTISEMENT

  • Most Popular

  • Past Week

    • Why Medicare must cover atrial fibrillation screening to prevent strokes

      Radhesh K. Gupta | Conditions
    • The American Board of Internal Medicine maintenance of certification lawsuit: What physicians need to know

      Brian Hudes, MD | Physician
    • Teaching joy transforms the future of medical practice [PODCAST]

      The Podcast by KevinMD | Podcast
    • Sabbaticals provide a critical lifeline for sustainable medical careers [PODCAST]

      The Podcast by KevinMD | Podcast
    • Charles Bonnet syndrome: Why the blind see hallucinations

      Ceres Alhelí Otero Peniche | Conditions
    • When language becomes the barrier: IMGs and autism diagnoses

      Ronald L. Lindsay, MD | Conditions
  • Past 6 Months

    • What is the minority tax in medicine?

      Tharini Nagarkar and Maranda C. Ward, EdD, MPH | Education
    • Why the U.S. health care system is failing patients and physicians

      John C. Hagan III, MD | Policy
    • Alex Pretti: a physician’s open letter defending his legacy

      Mousson Berrouet, DO | Physician
    • Health care as a human right vs. commodity: Resolving the paradox

      Timothy Lesaca, MD | Physician
    • Why voicemail in outpatient care is failing patients and staff

      Dan Ouellet | Tech
    • The elephant in the room: Why physician burnout is a relationship problem

      Tomi Mitchell, MD | Physician
  • Recent Posts

    • Sabbaticals provide a critical lifeline for sustainable medical careers [PODCAST]

      The Podcast by KevinMD | Podcast
    • Curing versus caring in medicine: Bridging the gap in patient trust

      Cherie Shah | Education
    • Flexible health care funding: Moving beyond disease eradication

      Selena Kattick | Policy
    • Why a chief wellness officer hid her medication use for 13 years

      Michael F. Myers, MD | Physician
    • Physician patient advocacy: Fighting insurance denials effectively

      Neil Baum, MD | Physician
    • Health care’s Upside Down: Addressing systemic dysfunction and burnout

      Ganesh Asaithambi, MD, MBA | Physician

MedPage Today Professional

An Everyday Health Property Medpage Today

Copyright © 2026 KevinMD.com | Powered by Astra WordPress Theme

  • Terms of Use | Disclaimer
  • Privacy Policy
  • DMCA Policy
All Content © KevinMD, LLC
Site by Outthink Group

Leave a Comment

Comments are moderated before they are published. Please read the comment policy.

Loading Comments...