Skip to content
  • About
  • Contact
  • Contribute
  • Book
  • Careers
  • Podcast
  • Recommended
  • Speaking
  • All
  • Physician
  • Practice
  • Policy
  • Finance
  • Conditions
  • .edu
  • Patient
  • Meds
  • Tech
  • Social
  • Video
    • All
    • Physician
    • Practice
    • Policy
    • Finance
    • Conditions
    • .edu
    • Patient
    • Meds
    • Tech
    • Social
    • Video
    • About
    • Contact
    • Contribute
    • Book
    • Careers
    • Podcast
    • Recommended
    • Speaking

The promise and perils of AI in health care: Why we need better testing standards

Max Rollwage, PhD
Tech
July 3, 2025
Share
Tweet
Share

There is substantial enthusiasm around the advancements of AI in health care, as exemplified by the media attention for the new HealthBench release from OpenAI and similar recent studies from Google (MedPalm2, AIME). Market enthusiasm often leads us to believe that we’re on the brink of AI doctors treating patients worldwide. However, while these developments represent interesting technical progress, they fall short of demonstrating readiness for clinical application.

The stakes in health care AI are extraordinarily high. When individuals seek AI guidance for urgent health concerns—like a baby struggling to breathe or a grandparent showing stroke symptoms—the advice must be unequivocally safe and accurate. Unfortunately, current methods for testing AI’s clinical readiness are often insufficient and circular.

These “breakthrough” studies typically face several critical limitations:

  • They test AI performance on artificial or simulated patient cases rather than real patient interactions.
  • They evaluate responses using automated AI evaluations instead of human expert assessment.
  • They lack proper evaluation of patient outcomes from clinical AI interactions.

Take HealthBench, for example, which uses 5,000 handcrafted scenarios to test AI clinical agents. While this represents progress toward wider coverage of test scenarios, these artificial scenarios likely fail to capture the true complexity of real-world patient presentations. Furthermore, when companies create their own testing scenarios, it’s impossible to verify whether these cases truly represent the full spectrum of medical situations or inadvertently favor their models’ capabilities.

Perhaps more fundamentally, benchmarks like HealthBench often use AI to evaluate the clinical appropriateness of other AI responses. This creates a problematic circular logic: We’re using AI to validate AI’s fitness for clinical use, essentially trusting its evaluative capabilities before proving their safety in high-stakes environments. At this stage, only human expert evaluations can provide appropriate ground truth for clinical performance. It’s important to ask the question: Are leading AI developers applying the necessary rigor in this crucial evaluation process?

The final test of any clinical tool lies in its impact on patient outcomes. This requires rigorous clinical trials that track how patients fare when the tool is used in their care—particularly their long-term recovery and health outcomes.

The current approach for clinical AI agents is akin to claiming a new drug is safe based solely on computational models of its molecular interactions (e.g. AlphaFold), without conducting comprehensive clinical trials. Just as drug development demands rigorous human testing to prove real-world safety and efficacy, AI intended for clinical use requires far more than AI-driven simulations.

The steps toward safe deployment of clinical AI agents require substantially improved testing frameworks and are likely to require more time and effort than anticipated by leading AI labs.

To truly safeguard patients and build trust, we must fundamentally elevate AI testing standards through:

  • Real-user interactions: Testing models with genuine clinical presentations from actual users.
  • Expert human evaluation: Having qualified clinicians assess the quality, safety, and appropriateness of AI responses.
  • Impact assessment: Conducting experimental, randomized studies to evaluate the tangible impact of AI interactions on user understanding, decisions, and well-being.

However, there are players in the space that take rigorous testing of clinical AI very seriously and make substantial progress. Government organizations like the FDA, and the U.S. and U.K. AI safety institutes are tasked with creating guidelines for how to test the safety of AI in clinical applications. For example, the FDA’s guidelines starkly differ in their recommendations of how to test AI for clinical fitness [1] compared to what large leading labs claim to be sufficient. At the same time, the U.S. and U.K. AI Safety Institutes and applied clinical AI companies are working to improve clinical AI testing validity by creating more appropriate benchmarks and understanding how large language models truly affect user well-being when used for medical purposes.

AI is still in its infancy, and only by embracing stringent, real-world testing can clinical AI mature responsibly. As with drug and therapeutics development, medical devices, and other clinically impactful options, AI needs to be thoroughly tested before it is allowed to stand alone in a clinical setting.

It’s the only path to developing AI models that are genuinely safe, effective, and beneficial for patient care, moving beyond theoretical benchmarks created by technologists to proven clinical utility administered by clinicians.

Max Rollwage is a health care executive.

ADVERTISEMENT

Prev

From burnout to balance: a neurosurgeon’s bold career redesign

July 3, 2025 Kevin 0
…
Next

The gift we keep giving: How medicine demands everything—even our holidays

July 3, 2025 Kevin 0
…

Tagged as: Health IT

Post navigation

< Previous Post
From burnout to balance: a neurosurgeon’s bold career redesign
Next Post >
The gift we keep giving: How medicine demands everything—even our holidays

ADVERTISEMENT

Related Posts

  • Why the health care industry must prioritize health equity

    George T. Mathew, MD, MBA
  • Bridging the rural surgical care gap with rotating health care teams

    Ankit Jain
  • What happened to real care in health care?

    Christopher H. Foster, PhD, MPA
  • To “fix” health care delivery, turn to a value-based health care system

    David Bernstein, MD, MBA
  • Health care’s hidden problem: hospital primary care losses

    Christopher Habig, MBA
  • Melting the iron triangle: Prioritizing health equity in dynamic, innovative health care landscapes

    Nina Cloven, MHA

More in Tech

  • 3 tips for using AI medical scribes to save time charting

    Erica Dorn, FNP
  • Would The Pitts’ Dr. Robby Robinavitch welcome a new colleague? Yes. Especially if their initials were AI.

    Gabe Jones, MBA
  • Generative AI 2025: a 20-minute cheat sheet for busy clinicians

    Harvey Castro, MD, MBA
  • Why public health must be included in AI development

    Laura E. Scudiere, RN, MPH
  • Here’s what providers really need in a modern EHR

    Laura Kohlhagen, MD, MBA
  • AI and humanity in health care: Preserving what makes us human

    Harvey Castro, MD, MBA
  • Most Popular

  • Past Week

    • Why are medical students turning away from primary care? [PODCAST]

      The Podcast by KevinMD | Podcast
    • Why “do no harm” might be harming modern medicine

      Sabooh S. Mubbashar, MD | Physician
    • Here’s what providers really need in a modern EHR

      Laura Kohlhagen, MD, MBA | Tech
    • How New Mexico became a malpractice lawsuit hotspot

      Patrick Hudson, MD | Physician
    • How community paramedicine impacts Indigenous elders

      Noah Weinberg | Conditions
    • Why doctors are reclaiming control from burnout culture

      Maureen Gibbons, MD | Physician
  • Past 6 Months

    • Why tracking cognitive load could save doctors and patients

      Hiba Fatima Hamid | Education
    • Why are medical students turning away from primary care? [PODCAST]

      The Podcast by KevinMD | Podcast
    • What the world must learn from the life and death of Hind Rajab

      Saba Qaiser, RN | Conditions
    • Why “do no harm” might be harming modern medicine

      Sabooh S. Mubbashar, MD | Physician
    • Here’s what providers really need in a modern EHR

      Laura Kohlhagen, MD, MBA | Tech
    • How medical culture hides burnout in plain sight

      Marco Benítez | Conditions
  • Recent Posts

    • From stigma to science: Rethinking the U.S. drug scheduling system

      Artin Asadipooya | Meds
    • The gift we keep giving: How medicine demands everything—even our holidays

      Tomi Mitchell, MD | Physician
    • The promise and perils of AI in health care: Why we need better testing standards

      Max Rollwage, PhD | Tech
    • From burnout to balance: a neurosurgeon’s bold career redesign

      Jessie Mahoney, MD | Physician
    • Healing the doctor-patient relationship by attacking administrative inefficiencies

      Allen Fredrickson | Policy
    • Who will train the next generation of primary care clinicians without physician mentorship? [PODCAST]

      The Podcast by KevinMD | Podcast

Subscribe to KevinMD and never miss a story!

Get free updates delivered free to your inbox.


Find jobs at
Careers by KevinMD.com

Search thousands of physician, PA, NP, and CRNA jobs now.

Learn more

Leave a Comment

Founded in 2004 by Kevin Pho, MD, KevinMD.com is the web’s leading platform where physicians, advanced practitioners, nurses, medical students, and patients share their insight and tell their stories.

Social

  • Like on Facebook
  • Follow on Twitter
  • Connect on Linkedin
  • Subscribe on Youtube
  • Instagram

ADVERTISEMENT

  • Most Popular

  • Past Week

    • Why are medical students turning away from primary care? [PODCAST]

      The Podcast by KevinMD | Podcast
    • Why “do no harm” might be harming modern medicine

      Sabooh S. Mubbashar, MD | Physician
    • Here’s what providers really need in a modern EHR

      Laura Kohlhagen, MD, MBA | Tech
    • How New Mexico became a malpractice lawsuit hotspot

      Patrick Hudson, MD | Physician
    • How community paramedicine impacts Indigenous elders

      Noah Weinberg | Conditions
    • Why doctors are reclaiming control from burnout culture

      Maureen Gibbons, MD | Physician
  • Past 6 Months

    • Why tracking cognitive load could save doctors and patients

      Hiba Fatima Hamid | Education
    • Why are medical students turning away from primary care? [PODCAST]

      The Podcast by KevinMD | Podcast
    • What the world must learn from the life and death of Hind Rajab

      Saba Qaiser, RN | Conditions
    • Why “do no harm” might be harming modern medicine

      Sabooh S. Mubbashar, MD | Physician
    • Here’s what providers really need in a modern EHR

      Laura Kohlhagen, MD, MBA | Tech
    • How medical culture hides burnout in plain sight

      Marco Benítez | Conditions
  • Recent Posts

    • From stigma to science: Rethinking the U.S. drug scheduling system

      Artin Asadipooya | Meds
    • The gift we keep giving: How medicine demands everything—even our holidays

      Tomi Mitchell, MD | Physician
    • The promise and perils of AI in health care: Why we need better testing standards

      Max Rollwage, PhD | Tech
    • From burnout to balance: a neurosurgeon’s bold career redesign

      Jessie Mahoney, MD | Physician
    • Healing the doctor-patient relationship by attacking administrative inefficiencies

      Allen Fredrickson | Policy
    • Who will train the next generation of primary care clinicians without physician mentorship? [PODCAST]

      The Podcast by KevinMD | Podcast

MedPage Today Professional

An Everyday Health Property Medpage Today
  • Terms of Use | Disclaimer
  • Privacy Policy
  • DMCA Policy
All Content © KevinMD, LLC
Site by Outthink Group

Leave a Comment

Comments are moderated before they are published. Please read the comment policy.

Loading Comments...