Skip to content
  • About
  • Contact
  • Contribute
  • Book
  • Careers
  • Podcast
  • Recommended
  • Speaking
  • All
  • Physician
  • Practice
  • Policy
  • Finance
  • Conditions
  • .edu
  • Patient
  • Meds
  • Tech
  • Social
  • Video
    • All
    • Physician
    • Practice
    • Policy
    • Finance
    • Conditions
    • .edu
    • Patient
    • Meds
    • Tech
    • Social
    • Video
    • About
    • Contact
    • Contribute
    • Book
    • Careers
    • Podcast
    • Recommended
    • Speaking

The promise and perils of AI in health care: Why we need better testing standards

Max Rollwage, PhD
Tech
July 3, 2025
Share
Tweet
Share

There is substantial enthusiasm around the advancements of AI in health care, as exemplified by the media attention for the new HealthBench release from OpenAI and similar recent studies from Google (MedPalm2, AIME). Market enthusiasm often leads us to believe that we’re on the brink of AI doctors treating patients worldwide. However, while these developments represent interesting technical progress, they fall short of demonstrating readiness for clinical application.

The stakes in health care AI are extraordinarily high. When individuals seek AI guidance for urgent health concerns—like a baby struggling to breathe or a grandparent showing stroke symptoms—the advice must be unequivocally safe and accurate. Unfortunately, current methods for testing AI’s clinical readiness are often insufficient and circular.

These “breakthrough” studies typically face several critical limitations:

  • They test AI performance on artificial or simulated patient cases rather than real patient interactions.
  • They evaluate responses using automated AI evaluations instead of human expert assessment.
  • They lack proper evaluation of patient outcomes from clinical AI interactions.

Take HealthBench, for example, which uses 5,000 handcrafted scenarios to test AI clinical agents. While this represents progress toward wider coverage of test scenarios, these artificial scenarios likely fail to capture the true complexity of real-world patient presentations. Furthermore, when companies create their own testing scenarios, it’s impossible to verify whether these cases truly represent the full spectrum of medical situations or inadvertently favor their models’ capabilities.

Perhaps more fundamentally, benchmarks like HealthBench often use AI to evaluate the clinical appropriateness of other AI responses. This creates a problematic circular logic: We’re using AI to validate AI’s fitness for clinical use, essentially trusting its evaluative capabilities before proving their safety in high-stakes environments. At this stage, only human expert evaluations can provide appropriate ground truth for clinical performance. It’s important to ask the question: Are leading AI developers applying the necessary rigor in this crucial evaluation process?

The final test of any clinical tool lies in its impact on patient outcomes. This requires rigorous clinical trials that track how patients fare when the tool is used in their care—particularly their long-term recovery and health outcomes.

The current approach for clinical AI agents is akin to claiming a new drug is safe based solely on computational models of its molecular interactions (e.g. AlphaFold), without conducting comprehensive clinical trials. Just as drug development demands rigorous human testing to prove real-world safety and efficacy, AI intended for clinical use requires far more than AI-driven simulations.

The steps toward safe deployment of clinical AI agents require substantially improved testing frameworks and are likely to require more time and effort than anticipated by leading AI labs.

To truly safeguard patients and build trust, we must fundamentally elevate AI testing standards through:

  • Real-user interactions: Testing models with genuine clinical presentations from actual users.
  • Expert human evaluation: Having qualified clinicians assess the quality, safety, and appropriateness of AI responses.
  • Impact assessment: Conducting experimental, randomized studies to evaluate the tangible impact of AI interactions on user understanding, decisions, and well-being.

However, there are players in the space that take rigorous testing of clinical AI very seriously and make substantial progress. Government organizations like the FDA, and the U.S. and U.K. AI safety institutes are tasked with creating guidelines for how to test the safety of AI in clinical applications. For example, the FDA’s guidelines starkly differ in their recommendations of how to test AI for clinical fitness [1] compared to what large leading labs claim to be sufficient. At the same time, the U.S. and U.K. AI Safety Institutes and applied clinical AI companies are working to improve clinical AI testing validity by creating more appropriate benchmarks and understanding how large language models truly affect user well-being when used for medical purposes.

AI is still in its infancy, and only by embracing stringent, real-world testing can clinical AI mature responsibly. As with drug and therapeutics development, medical devices, and other clinically impactful options, AI needs to be thoroughly tested before it is allowed to stand alone in a clinical setting.

It’s the only path to developing AI models that are genuinely safe, effective, and beneficial for patient care, moving beyond theoretical benchmarks created by technologists to proven clinical utility administered by clinicians.

Max Rollwage is a health care executive.

ADVERTISEMENT

Prev

From burnout to balance: a neurosurgeon’s bold career redesign

July 3, 2025 Kevin 0
…
Next

The gift we keep giving: How medicine demands everything—even our holidays

July 3, 2025 Kevin 0
…

Tagged as: Health IT

Post navigation

< Previous Post
From burnout to balance: a neurosurgeon’s bold career redesign
Next Post >
The gift we keep giving: How medicine demands everything—even our holidays

ADVERTISEMENT

Related Posts

  • Why the health care industry must prioritize health equity

    George T. Mathew, MD, MBA
  • Bridging the rural surgical care gap with rotating health care teams

    Ankit Jain
  • What happened to real care in health care?

    Christopher H. Foster, PhD, MPA
  • To “fix” health care delivery, turn to a value-based health care system

    David Bernstein, MD, MBA
  • Health care’s hidden problem: hospital primary care losses

    Christopher Habig, MBA
  • Melting the iron triangle: Prioritizing health equity in dynamic, innovative health care landscapes

    Nina Cloven, MHA

More in Tech

  • The silent cost of choosing personalization over privacy in health care

    Dr. Giriraj Tosh Purohit
  • Why trust and simplicity matter more than buzzwords in hospital AI

    Rafael Rolon Rivera, MD
  • ChatGPT in health care: risks, benefits, and safer options

    Erica Dorn, FNP
  • Why AI must support, not replace, human intuition in health care

    Rafael Rolon Rivera, MD
  • Why health care reform must start with ending monopolies

    Lee Ann McWhorter
  • AI can help heal the fragmented U.S. health care system

    Phillip Polakoff, MD and June Sargent
  • Most Popular

  • Past Week

    • New student loan caps could shut low-income students out of medicine

      Tom Phan, MD | Physician
    • How federal actions threaten vaccine policy and trust

      American College of Physicians | Conditions
    • What street medicine taught me about healing

      Alina Kang | Education
    • Are we repeating the statin playbook with lipoprotein(a)?

      Larry Kaskel, MD | Conditions
    • Why transgender health care needs urgent reform and inclusive practices

      Angela Rodriguez, MD | Conditions
    • mRNA post vaccination syndrome: Is it real?

      Harry Oken, MD | Conditions
  • Past 6 Months

    • COVID-19 was real: a doctor’s frontline account

      Randall S. Fong, MD | Conditions
    • Why primary care doctors are drowning in debt despite saving lives

      John Wei, MD | Physician
    • New student loan caps could shut low-income students out of medicine

      Tom Phan, MD | Physician
    • Confessions of a lipidologist in recovery: the infection we’ve ignored for 40 years

      Larry Kaskel, MD | Conditions
    • A physician employment agreement term that often tricks physicians

      Dennis Hursh, Esq | Finance
    • Why taxing remittances harms families and global health care

      Dalia Saha, MD | Finance
  • Recent Posts

    • An addiction physician’s warning about America’s next public health crisis [PODCAST]

      The Podcast by KevinMD | Podcast
    • Gen Z’s DIY approach to health care

      Amanda Heidemann, MD | Education
    • What street medicine taught me about healing

      Alina Kang | Education
    • Smart asset protection strategies every doctor needs

      Paul Morton, CFP | Finance
    • The silent cost of choosing personalization over privacy in health care

      Dr. Giriraj Tosh Purohit | Tech
    • How IMGs can find purpose in clinical research [PODCAST]

      The Podcast by KevinMD | Podcast

Subscribe to KevinMD and never miss a story!

Get free updates delivered free to your inbox.


Find jobs at
Careers by KevinMD.com

Search thousands of physician, PA, NP, and CRNA jobs now.

Learn more

Leave a Comment

Founded in 2004 by Kevin Pho, MD, KevinMD.com is the web’s leading platform where physicians, advanced practitioners, nurses, medical students, and patients share their insight and tell their stories.

Social

  • Like on Facebook
  • Follow on Twitter
  • Connect on Linkedin
  • Subscribe on Youtube
  • Instagram

ADVERTISEMENT

ADVERTISEMENT

  • Most Popular

  • Past Week

    • New student loan caps could shut low-income students out of medicine

      Tom Phan, MD | Physician
    • How federal actions threaten vaccine policy and trust

      American College of Physicians | Conditions
    • What street medicine taught me about healing

      Alina Kang | Education
    • Are we repeating the statin playbook with lipoprotein(a)?

      Larry Kaskel, MD | Conditions
    • Why transgender health care needs urgent reform and inclusive practices

      Angela Rodriguez, MD | Conditions
    • mRNA post vaccination syndrome: Is it real?

      Harry Oken, MD | Conditions
  • Past 6 Months

    • COVID-19 was real: a doctor’s frontline account

      Randall S. Fong, MD | Conditions
    • Why primary care doctors are drowning in debt despite saving lives

      John Wei, MD | Physician
    • New student loan caps could shut low-income students out of medicine

      Tom Phan, MD | Physician
    • Confessions of a lipidologist in recovery: the infection we’ve ignored for 40 years

      Larry Kaskel, MD | Conditions
    • A physician employment agreement term that often tricks physicians

      Dennis Hursh, Esq | Finance
    • Why taxing remittances harms families and global health care

      Dalia Saha, MD | Finance
  • Recent Posts

    • An addiction physician’s warning about America’s next public health crisis [PODCAST]

      The Podcast by KevinMD | Podcast
    • Gen Z’s DIY approach to health care

      Amanda Heidemann, MD | Education
    • What street medicine taught me about healing

      Alina Kang | Education
    • Smart asset protection strategies every doctor needs

      Paul Morton, CFP | Finance
    • The silent cost of choosing personalization over privacy in health care

      Dr. Giriraj Tosh Purohit | Tech
    • How IMGs can find purpose in clinical research [PODCAST]

      The Podcast by KevinMD | Podcast

MedPage Today Professional

An Everyday Health Property Medpage Today
  • Terms of Use | Disclaimer
  • Privacy Policy
  • DMCA Policy
All Content © KevinMD, LLC
Site by Outthink Group

Leave a Comment

Comments are moderated before they are published. Please read the comment policy.

Loading Comments...