Skip to content
  • About
  • Contact
  • Contribute
  • Book
  • Careers
  • Podcast
  • Recommended
  • Speaking
  • All
  • Physician
  • Practice
  • Policy
  • Finance
  • Conditions
  • .edu
  • Patient
  • Meds
  • Tech
  • Social
  • Video
    • All
    • Physician
    • Practice
    • Policy
    • Finance
    • Conditions
    • .edu
    • Patient
    • Meds
    • Tech
    • Social
    • Video
    • About
    • Contact
    • Contribute
    • Book
    • Careers
    • Podcast
    • Recommended
    • Speaking

The promise and perils of AI in health care: Why we need better testing standards

Max Rollwage, PhD
Tech
July 3, 2025
Share
Tweet
Share

There is substantial enthusiasm around the advancements of AI in health care, as exemplified by the media attention for the new HealthBench release from OpenAI and similar recent studies from Google (MedPalm2, AIME). Market enthusiasm often leads us to believe that we’re on the brink of AI doctors treating patients worldwide. However, while these developments represent interesting technical progress, they fall short of demonstrating readiness for clinical application.

The stakes in health care AI are extraordinarily high. When individuals seek AI guidance for urgent health concerns—like a baby struggling to breathe or a grandparent showing stroke symptoms—the advice must be unequivocally safe and accurate. Unfortunately, current methods for testing AI’s clinical readiness are often insufficient and circular.

These “breakthrough” studies typically face several critical limitations:

  • They test AI performance on artificial or simulated patient cases rather than real patient interactions.
  • They evaluate responses using automated AI evaluations instead of human expert assessment.
  • They lack proper evaluation of patient outcomes from clinical AI interactions.

Take HealthBench, for example, which uses 5,000 handcrafted scenarios to test AI clinical agents. While this represents progress toward wider coverage of test scenarios, these artificial scenarios likely fail to capture the true complexity of real-world patient presentations. Furthermore, when companies create their own testing scenarios, it’s impossible to verify whether these cases truly represent the full spectrum of medical situations or inadvertently favor their models’ capabilities.

Perhaps more fundamentally, benchmarks like HealthBench often use AI to evaluate the clinical appropriateness of other AI responses. This creates a problematic circular logic: We’re using AI to validate AI’s fitness for clinical use, essentially trusting its evaluative capabilities before proving their safety in high-stakes environments. At this stage, only human expert evaluations can provide appropriate ground truth for clinical performance. It’s important to ask the question: Are leading AI developers applying the necessary rigor in this crucial evaluation process?

The final test of any clinical tool lies in its impact on patient outcomes. This requires rigorous clinical trials that track how patients fare when the tool is used in their care—particularly their long-term recovery and health outcomes.

The current approach for clinical AI agents is akin to claiming a new drug is safe based solely on computational models of its molecular interactions (e.g. AlphaFold), without conducting comprehensive clinical trials. Just as drug development demands rigorous human testing to prove real-world safety and efficacy, AI intended for clinical use requires far more than AI-driven simulations.

The steps toward safe deployment of clinical AI agents require substantially improved testing frameworks and are likely to require more time and effort than anticipated by leading AI labs.

To truly safeguard patients and build trust, we must fundamentally elevate AI testing standards through:

  • Real-user interactions: Testing models with genuine clinical presentations from actual users.
  • Expert human evaluation: Having qualified clinicians assess the quality, safety, and appropriateness of AI responses.
  • Impact assessment: Conducting experimental, randomized studies to evaluate the tangible impact of AI interactions on user understanding, decisions, and well-being.

However, there are players in the space that take rigorous testing of clinical AI very seriously and make substantial progress. Government organizations like the FDA, and the U.S. and U.K. AI safety institutes are tasked with creating guidelines for how to test the safety of AI in clinical applications. For example, the FDA’s guidelines starkly differ in their recommendations of how to test AI for clinical fitness [1] compared to what large leading labs claim to be sufficient. At the same time, the U.S. and U.K. AI Safety Institutes and applied clinical AI companies are working to improve clinical AI testing validity by creating more appropriate benchmarks and understanding how large language models truly affect user well-being when used for medical purposes.

AI is still in its infancy, and only by embracing stringent, real-world testing can clinical AI mature responsibly. As with drug and therapeutics development, medical devices, and other clinically impactful options, AI needs to be thoroughly tested before it is allowed to stand alone in a clinical setting.

It’s the only path to developing AI models that are genuinely safe, effective, and beneficial for patient care, moving beyond theoretical benchmarks created by technologists to proven clinical utility administered by clinicians.

Max Rollwage is a health care executive.

ADVERTISEMENT

Prev

From burnout to balance: a neurosurgeon’s bold career redesign

July 3, 2025 Kevin 0
…
Next

The gift we keep giving: How medicine demands everything—even our holidays

July 3, 2025 Kevin 0
…

Tagged as: Health IT

Post navigation

< Previous Post
From burnout to balance: a neurosurgeon’s bold career redesign
Next Post >
The gift we keep giving: How medicine demands everything—even our holidays

ADVERTISEMENT

Related Posts

  • Why the health care industry must prioritize health equity

    George T. Mathew, MD, MBA
  • Bridging the rural surgical care gap with rotating health care teams

    Ankit Jain
  • What happened to real care in health care?

    Christopher H. Foster, PhD, MPA
  • To “fix” health care delivery, turn to a value-based health care system

    David Bernstein, MD, MBA
  • Health care’s hidden problem: hospital primary care losses

    Christopher Habig, MBA
  • Melting the iron triangle: Prioritizing health equity in dynamic, innovative health care landscapes

    Nina Cloven, MHA

More in Tech

  • The future of clinical care: AI’s role in easing physician workload

    Michael Wakeman
  • Why Grok 4 could be the next leap for HIPAA-compliant clinical AI

    Harvey Castro, MD, MBA
  • AI is already replacing doctors—just not how you think

    Bhargav Raman, MD, MBA
  • A mind to guide the machine: Why physicians must help shape artificial intelligence in medicine

    Shanice Spence-Miller, MD
  • How digital tools are reshaping the doctor-patient relationship

    Vineet Vishwanath
  • 3 tips for using AI medical scribes to save time charting

    Erica Dorn, FNP
  • Most Popular

  • Past Week

    • Who gets to be well in America: Immigrant health is on the line

      Joshua Vasquez, MD | Policy
    • Why specialist pain clinics and addiction treatment services require strong primary care

      Olumuyiwa Bamgbade, MD | Conditions
    • Harassment and overreach are driving physicians to quit

      Olumuyiwa Bamgbade, MD | Physician
    • Why peer support can save lives in high-pressure medical careers

      Maire Daugharty, MD | Conditions
    • When a medical office sublease turns into a legal nightmare

      Ralph Messo, DO | Physician
    • Addressing menstrual health inequities in adolescents

      Callia Georgoulis | Conditions
  • Past 6 Months

    • Forced voicemail and diagnosis codes are endangering patient access to medications

      Arthur Lazarus, MD, MBA | Meds
    • How President Biden’s cognitive health shapes political and legal trust

      Muhamad Aly Rifai, MD | Conditions
    • Why are medical students turning away from primary care? [PODCAST]

      The Podcast by KevinMD | Podcast
    • The One Big Beautiful Bill and the fragile heart of rural health care

      Holland Haynie, MD | Policy
    • Who gets to be well in America: Immigrant health is on the line

      Joshua Vasquez, MD | Policy
    • Why “do no harm” might be harming modern medicine

      Sabooh S. Mubbashar, MD | Physician
  • Recent Posts

    • The shocking risk every smart student faces when applying to medical school

      Curtis G. Graham, MD | Physician
    • Clinical ghosts and why they haunt our exam rooms

      Kara Wada, MD | Conditions
    • High blood pressure’s hidden impact on kidney health in older adults

      Edmond Kubi Appiah, MPH | Conditions
    • Deep transcranial magnetic stimulation for depression [PODCAST]

      The Podcast by KevinMD | Podcast
    • How declining MMR vaccination rates put future generations at risk

      Ambika Sharma, Onyi Oligbo, and Katrina Green, MD | Conditions
    • The physician who turned burnout into a mission for change

      Jessie Mahoney, MD | Physician

Subscribe to KevinMD and never miss a story!

Get free updates delivered free to your inbox.


Find jobs at
Careers by KevinMD.com

Search thousands of physician, PA, NP, and CRNA jobs now.

Learn more

Leave a Comment

Founded in 2004 by Kevin Pho, MD, KevinMD.com is the web’s leading platform where physicians, advanced practitioners, nurses, medical students, and patients share their insight and tell their stories.

Social

  • Like on Facebook
  • Follow on Twitter
  • Connect on Linkedin
  • Subscribe on Youtube
  • Instagram

ADVERTISEMENT

  • Most Popular

  • Past Week

    • Who gets to be well in America: Immigrant health is on the line

      Joshua Vasquez, MD | Policy
    • Why specialist pain clinics and addiction treatment services require strong primary care

      Olumuyiwa Bamgbade, MD | Conditions
    • Harassment and overreach are driving physicians to quit

      Olumuyiwa Bamgbade, MD | Physician
    • Why peer support can save lives in high-pressure medical careers

      Maire Daugharty, MD | Conditions
    • When a medical office sublease turns into a legal nightmare

      Ralph Messo, DO | Physician
    • Addressing menstrual health inequities in adolescents

      Callia Georgoulis | Conditions
  • Past 6 Months

    • Forced voicemail and diagnosis codes are endangering patient access to medications

      Arthur Lazarus, MD, MBA | Meds
    • How President Biden’s cognitive health shapes political and legal trust

      Muhamad Aly Rifai, MD | Conditions
    • Why are medical students turning away from primary care? [PODCAST]

      The Podcast by KevinMD | Podcast
    • The One Big Beautiful Bill and the fragile heart of rural health care

      Holland Haynie, MD | Policy
    • Who gets to be well in America: Immigrant health is on the line

      Joshua Vasquez, MD | Policy
    • Why “do no harm” might be harming modern medicine

      Sabooh S. Mubbashar, MD | Physician
  • Recent Posts

    • The shocking risk every smart student faces when applying to medical school

      Curtis G. Graham, MD | Physician
    • Clinical ghosts and why they haunt our exam rooms

      Kara Wada, MD | Conditions
    • High blood pressure’s hidden impact on kidney health in older adults

      Edmond Kubi Appiah, MPH | Conditions
    • Deep transcranial magnetic stimulation for depression [PODCAST]

      The Podcast by KevinMD | Podcast
    • How declining MMR vaccination rates put future generations at risk

      Ambika Sharma, Onyi Oligbo, and Katrina Green, MD | Conditions
    • The physician who turned burnout into a mission for change

      Jessie Mahoney, MD | Physician

MedPage Today Professional

An Everyday Health Property Medpage Today
  • Terms of Use | Disclaimer
  • Privacy Policy
  • DMCA Policy
All Content © KevinMD, LLC
Site by Outthink Group

Leave a Comment

Comments are moderated before they are published. Please read the comment policy.

Loading Comments...