Skip to content
  • About
  • Contact
  • Contribute
  • Book
  • Careers
  • Podcast
  • Recommended
  • Speaking
  • All
  • Physician
  • Practice
  • Policy
  • Finance
  • Conditions
  • .edu
  • Patient
  • Meds
  • Tech
  • Social
  • Video
    • All
    • Physician
    • Practice
    • Policy
    • Finance
    • Conditions
    • .edu
    • Patient
    • Meds
    • Tech
    • Social
    • Video
    • About
    • Contact
    • Contribute
    • Book
    • Careers
    • Podcast
    • Recommended
    • Speaking

Why AI in health care needs stronger testing before clinical use [PODCAST]

The Podcast by KevinMD
Podcast
September 12, 2025
Share
Tweet
Share
YouTube video

Subscribe to The Podcast by KevinMD. Watch on YouTube. Catch up on old episodes!

Health care executive Max Rollwage discusses his article “The promise and perils of AI in health care: Why we need better testing standards.” Max highlights the risks of relying on AI systems that are evaluated through artificial scenarios or even other AI models, emphasizing the dangers of circular validation in high-stakes environments like emergency care. He discusses why true readiness requires real-world clinical trials, expert human evaluation, and impact studies on patient outcomes. Max also points to the role of regulators like the FDA and AI Safety Institutes in setting higher standards, and he offers practical insights on how the medical community and technology developers can ensure safe and effective AI deployment. Listeners will come away with a clearer understanding of both the promise and the perils of AI in medicine, and what it will take to responsibly bring these tools to the bedside.

Our presenting sponsor is Microsoft Dragon Copilot.

Want to streamline your clinical documentation and take advantage of customizations that put you in control? What about the ability to surface information right at the point of care or automate tasks with just a click? Now, you can.

Microsoft Dragon Copilot, your AI assistant for clinical workflow, is transforming how clinicians work. Offering an extensible AI workspace and a single, integrated platform, Dragon Copilot can help you unlock new levels of efficiency. Plus, it’s backed by a proven track record and decades of clinical expertise, and it’s part of Microsoft Cloud for Healthcare, built on a foundation of trust.

Ease your administrative burdens and stay focused on what matters most with Dragon Copilot, your AI assistant for clinical workflow.

VISIT SPONSOR → https://aka.ms/kevinmd

SUBSCRIBE TO THE PODCAST → https://www.kevinmd.com/podcast

RECOMMENDED BY KEVINMD → https://www.kevinmd.com/recommended

Transcript

Kevin Pho: Hi, and welcome to the show. Subscribe at KevinMD.com/podcast. Today we welcome Max Rollwage. He is a health care executive. Today’s KevinMD article is “The promise and perils of AI in health care: Why we need better testing standards.” Max, welcome to the show.

Max Rollwage: Hi, Kevin. Thank you so much for having me. I am very excited to speak about this important topic.

Kevin Pho: All right, so let us start by briefly sharing your story and journey.

ADVERTISEMENT

Max Rollwage: Yeah, so my background is in clinical psychology, and maybe it is helpful to say I am working at the moment on building clinical AI and testing clinical AI at a mental health company. We use AI to support patients and clinicians to make mental health care more accessible and more effective. So really, day-in and day-out, I really think about how we can use AI in a safe and effective way to support and amplify clinicians in a health care setting.

That thinking obviously means that you look around in the industry and at the research that is coming out. One thing that I found very interesting, or actually surprising to say the least, was that when you look especially at the big labs that bring out the frontier models, there seems to be a disconnect between the great work they are doing of pushing the capabilities of these models. I am really enthusiastic about it because they are doing a fantastic job there. But then it kind of all falls down a bit when it comes to testing these models, especially for very sensitive applications like health care. And that obviously bears massive risks. That is kind of the reason why I wrote this article: to explore how good the testing of these models is to date and where we as an industry might need to improve in order to make sure that we really push innovation forward in a responsible and safe way.

Kevin Pho: All right, so let us talk about that. You, of course, talk about that in your KevinMD article, “The promise and perils of AI in health care.” For those who did not get a chance to read your article, tell us more.

Max Rollwage: Yeah, so the article was triggered by a bunch of research and, I guess, general media attention coming out for what seemed like breakthrough approaches in clinical AI in health care. Specifically, I wrote the article when the OpenAI HealthBench came out because there was a lot of noise around this, basically along the lines of the feeling that in a few moments we would have AI doctors treating patients everywhere and in the real world.

Kevin Pho: And then for those who are not familiar with HealthBench, tell us exactly what that is. So get people up to speed.

Max Rollwage: Yeah, so ultimately a lot of these leading AI companies have done research where they have taken some form of tests where they simulate patients and show that AI can perform at the same level as a human or even beyond a human-level clinician. The problem that I discussed in this article is really that these tests are very artificial. So take as an example this HealthBench. What it really is is like five thousand use cases across all forms of medicine. Clinicians have drafted simulated patient interactions, and then they see how a generative AI model would respond to that interaction. This is then graded by other AI models for the performance on these five thousand cases.

The problem that I see here is, on the one hand, five thousand cases is a good starting point, but if you think about the whole variety of all the medical fields and the real-world complexity of patients, it is only like a drop in the ocean. It really does not capture all the ways in how patients might interact with AI. The other bit that I think is quite important as well is these approaches are a bit circular. If you have AI evaluators evaluating the performance of an AI doctor, you kind of get into this circular conclusion. We are trying to establish if AI is good enough for health care and safe enough for health care, but then we are relying on AI to give us the answer. These were kind of a few things where I was like, “I cannot understand why this is the approach that leading companies are taking,” because it has proven in the past to be good for improving their models. That is kind of the engineering approach that they took. I do think there is a big difference between that and the safety approach, the clinical testing that you need to do in order to bring something like this to the real world.

Kevin Pho: And I am seeing that a lot of these new iterations of AI models, they are specifically geared to pass these benchmarks, whether it is HealthBench or a variety of benchmarks. Whenever the latest and greatest AI model comes up, they always say that they are the best in these benchmarks. It leads to the suspicion that these new models are just being trained to meet benchmarks, like teaching to the test, so to speak.

Max Rollwage: Exactly. I think that is exactly the reason why these companies are doing that, so that they can train models on a ground truth. But there is a real question, and I am very critical of this, of whether that translates to the real world. As an example, just speaking a bit from my own experience in building AI for patient-facing care, one of the things you experience is patients behave in a very different way than you would expect. If you only have these five thousand simulated cases, I guarantee you a patient will surprise you by interacting with these AI agents in a way that you did not anticipate. There might be ways of it falling over that you did not have in your limited benchmark test case.

The other thing, as I mentioned, is at this point where we are still trying to establish whether AI is good enough for clinical practice, human clinicians can be the only ground truth to evaluate how well these AI models are doing. Then the final thing, which I think is really, really critical and which I see far too little, is that ultimately, on the one hand, there is seeing whether the AI gives a good response based on objective criteria. But ultimately, what we really care about, I think, is what is the impact on the patient. That is only possible by doing real-world clinical studies, seeing in a controlled setting, having users interact with these models. Is it effective in the long run, for instance, through RCTs on clinical outcomes, and is it safe? Again, I think there is really this disconnect between a very engineering-focused way of building these models, which benchmarks are great for, but I think we really need to bridge it into a more clinical application step to make sure that these models really are safe and effective in the real world.

Kevin Pho: And when you say that these AI models are being used in clinical settings, how are they being used? Are they being used in decision support? Are they being used as chatbots, patient-facing chatbots? How are these AI models being used clinically that concerns you so much about the way they are being tested?

Max Rollwage: One of the things I think that is the biggest concern, in all honesty, is for these big labs, a lot of the models are used just by users as a substitute for a therapist. I think we recently, unfortunately, had a lot of media attention for all the ways that ChatGPT users, for instance, use these models for wellbeing support, which often has quite detrimental effects. Similarly, I think you also see a lot of companies that try to use these models with a wrapper around for wellbeing and therapy support, things like this.

Do not get me wrong; I work for a mental health company, and I think there is tremendous value in using these models. They really do get to a point, when properly combined with the right safeguards and the right architecture, that they can really amplify the health care system and really help patients. I am a big advocate for that, but it really needs a lot of thought on how you use it. As an example, in our company, we use it always in combination with a clinician. That would be, for example, for intake and referral, which then provides information to the clinician at assessment, or for therapy-adjacent support in combination with the human therapist.

Just to be clear again, I do think in order to harness the power of AI, all these applications in the future will be applications that we should explore. It is more important just to stick to a high bar of rigor and quality, which means really testing the safety of all the features that you are building and then running these clinical studies. Similar to drug development, I would say it is only the very first step to test the compound of the drug that you have developed. I think that is kind of equivalent to where we are at the moment with these benchmarks. We have kind of done the very first step where, in principle, we have shown that these models could do something interesting, but now you need to go into phase one, phase two, and phase three trials to see if you release these in the real world, is it feasible, effective, and safe to do?

Kevin Pho: You mentioned using the rigor of the FDA and drug clinical trials before we can use AI for clinical applications. Do you ever worry that it is going to be too expensive or take too long to measure and that is going to slow down the innovation of these AI models that are iterating so quickly? Where do you find a balance there?

Max Rollwage: It is a great question. In all honesty, I think about this every day. As an example, I have the experience of regulating a mental health chatbot as a medical device, which at that point, and I think until now, is the only mental health chatbot that is a medical device. I know what effort it takes, and I completely agree it takes a lot of effort. As an example, in order to do that, we produced ten thousand pages of safety documentation on the behavior of these models and ran three large-scale, real-world studies with hundreds of thousands of patients. So it is effortful. I agree.

However, on the flip side, if you see the alternative, the alternative would be to basically fly blind, to just be like, “OK, let us deploy these models and see what happens.” Unfortunately, as we see, there is, especially in the mental health space, a large latent demand for utilizing these AI products. That happens often right now in an unregulated way because people just go to ChatGPT, as an example, to utilize that as a counselor. The consequences, I think, are unfortunately piling up right now, as we see in the media. I personally do not think that that alternative is more appealing for society and for every individual using these products.

Kevin Pho: So that scenario that you described, flying blind and just releasing these products and then seeing what happens, that is pretty much the way Silicon Valley thinks. That is the way these technology startups work; they just release things and see what happens, and then they just iterate to the next model. How do you get them to take notice of some of the clinical consequences of that approach? Do you see companies like OpenAI and the various startups slowing down because of the consequences? Because that is not part of their DNA.

Max Rollwage: In all honesty, I cannot really comment on OpenAI. I am not certain what their policy is. I can definitely see companies in the more applied clinical space doing the right thing. I think it is more in their DNA, often having more clinicians and researchers on staff. I think that is critical to bring that evidence-based thinking into building products. That is something for myself that I find very important in this form of innovation.

The other aspect that I think, and it is often seen as a bad word, but regulation and oversight is something I personally really am quite passionate about. I think there is a real critical time in history right now where a collaboration between institutions like the FDA or these non-regulatory bodies like the AI safety institutes, which have been built in the U.S. and in the U.K., can play a role. They can work together with industry and agree on maybe ways of testing safety that are not as burdensome as it is always seen. In reality, that kind of testing is, I think, getting a bad rap, but it really, really is the rational way of how you ensure that you are doing the right thing. I really think bringing these different policymakers and the industry together is something that I would love to see, because I think that is the way we can solve that and harness the potential by not stunting the innovation.

Kevin poe: What kind of regulatory initiatives are, to your knowledge, going on, whether in the U.S. or the U.K.? Is the government getting involved in these potential AI safety initiatives?

Max Rollwage: Again, I can only speak from the parts that I see in my world. These regulators definitely have a lot of interest in working together with industry. We collaborate, as an example, with these safety institutes as well as with the regulators to figure out what is the best path forward to straddle exactly that balance. Personally, from my experience, I do think they are doing a very good job and being open and trying to get that dialogue going. I very strongly believe that that is a way forward.

Kevin Pho: We are talking to Max Rollwage. He is a health care executive. Today’s KevinMD article is “The promise and perils of AI in health care: Why we need better testing standards.” Max, let us end with some take-home messages that you want to leave with the KevinMD audience.

Max Rollwage: So I think the take-home I would like to deliver is we are at a time where AI has fantastic capabilities. I really, truly believe there will be great advantages in health care delivered by AI, but currently, our testing approach for making sure that AI is safe and effective lags quite behind the capabilities that we see. The one thing I would like to avoid is clinicians thinking that AI is unsafe or scary. That is not what I am saying. What I am saying is there are ways of implementing AI in a safe way; it is more that the industry in general needs to start to adopt these high standards to really make sure that the products that we are using are tested and evidence-based.

Kevin Pho: Max, thank you so much for sharing your perspective and insight. Thanks again for coming on the show.

Max Rollwage: Thank you very much for having me.

Prev

How AI is reshaping preventive medicine

September 12, 2025 Kevin 0
…

Kevin

Tagged as: Health IT

Post navigation

< Previous Post
How AI is reshaping preventive medicine

ADVERTISEMENT

More by The Podcast by KevinMD

  • How doctors can think like CEOs [PODCAST]

    The Podcast by KevinMD
  • Why doctors struggle with family caregiving and how to find grace [PODCAST]

    The Podcast by KevinMD
  • From nurse practitioner to leader in quality improvement [PODCAST]

    The Podcast by KevinMD

Related Posts

  • Why the health care industry must prioritize health equity

    George T. Mathew, MD, MBA
  • Bridging the rural surgical care gap with rotating health care teams

    Ankit Jain
  • What happened to real care in health care?

    Christopher H. Foster, PhD, MPA
  • To “fix” health care delivery, turn to a value-based health care system

    David Bernstein, MD, MBA
  • Health care’s hidden problem: hospital primary care losses

    Christopher Habig, MBA
  • Melting the iron triangle: Prioritizing health equity in dynamic, innovative health care landscapes

    Nina Cloven, MHA

More in Podcast

  • How doctors can think like CEOs [PODCAST]

    The Podcast by KevinMD
  • Why doctors struggle with family caregiving and how to find grace [PODCAST]

    The Podcast by KevinMD
  • From nurse practitioner to leader in quality improvement [PODCAST]

    The Podcast by KevinMD
  • Why retail pharmacies could transform diversity in clinical trials [PODCAST]

    The Podcast by KevinMD
  • Why vitamins should be part of the mental health conversation [PODCAST]

    The Podcast by KevinMD
  • How Ukrainian doctors sustained diabetes care during the war [PODCAST]

    The Podcast by KevinMD
  • Most Popular

  • Past Week

    • Health equity in Inland Southern California requires urgent action

      Vishruth Nagam | Policy
    • Why your clinic waiting room may affect patient outcomes

      Ziya Altug, PT, DPT and Shirish Sachdeva, PT, DPT | Conditions
    • The backbone of health care is breaking

      Grace Yu, MD | Physician
    • How new loan caps could destroy diversity in medical education

      Caleb Andrus-Gazyeva | Policy
    • Why transplant equity requires more than access

      Zamra Amjid, DHSc, MHA | Policy
    • The ethical crossroads of medicine and legislation

      M. Bennet Broner, PhD | Conditions
  • Past 6 Months

    • Health equity in Inland Southern California requires urgent action

      Vishruth Nagam | Policy
    • Why transgender health care needs urgent reform and inclusive practices

      Angela Rodriguez, MD | Conditions
    • How restrictive opioid policies worsen the crisis

      Kayvan Haddadan, MD | Physician
    • New student loan caps could shut low-income students out of medicine

      Tom Phan, MD | Physician
    • Why pain doctors face unfair scrutiny and harsh penalties in California

      Kayvan Haddadan, MD | Physician
    • mRNA post vaccination syndrome: Is it real?

      Harry Oken, MD | Conditions
  • Recent Posts

    • Why AI in health care needs stronger testing before clinical use [PODCAST]

      The Podcast by KevinMD | Podcast
    • How AI is reshaping preventive medicine

      Jalene Jacob, MD, MBA | Tech
    • How transplant recipients can pay it forward through organ donation

      Deepak Gupta, MD | Physician
    • Inside the high-stakes world of neurosurgery

      Isaac Yang, MD | Conditions
    • Why I left the clinic to lead health care from the inside

      Vandana Maurya, MHA | Conditions
    • How doctors can think like CEOs [PODCAST]

      The Podcast by KevinMD | Podcast

Subscribe to KevinMD and never miss a story!

Get free updates delivered free to your inbox.


Find jobs at
Careers by KevinMD.com

Search thousands of physician, PA, NP, and CRNA jobs now.

Learn more

Leave a Comment

Founded in 2004 by Kevin Pho, MD, KevinMD.com is the web’s leading platform where physicians, advanced practitioners, nurses, medical students, and patients share their insight and tell their stories.

Social

  • Like on Facebook
  • Follow on Twitter
  • Connect on Linkedin
  • Subscribe on Youtube
  • Instagram

ADVERTISEMENT

ADVERTISEMENT

  • Most Popular

  • Past Week

    • Health equity in Inland Southern California requires urgent action

      Vishruth Nagam | Policy
    • Why your clinic waiting room may affect patient outcomes

      Ziya Altug, PT, DPT and Shirish Sachdeva, PT, DPT | Conditions
    • The backbone of health care is breaking

      Grace Yu, MD | Physician
    • How new loan caps could destroy diversity in medical education

      Caleb Andrus-Gazyeva | Policy
    • Why transplant equity requires more than access

      Zamra Amjid, DHSc, MHA | Policy
    • The ethical crossroads of medicine and legislation

      M. Bennet Broner, PhD | Conditions
  • Past 6 Months

    • Health equity in Inland Southern California requires urgent action

      Vishruth Nagam | Policy
    • Why transgender health care needs urgent reform and inclusive practices

      Angela Rodriguez, MD | Conditions
    • How restrictive opioid policies worsen the crisis

      Kayvan Haddadan, MD | Physician
    • New student loan caps could shut low-income students out of medicine

      Tom Phan, MD | Physician
    • Why pain doctors face unfair scrutiny and harsh penalties in California

      Kayvan Haddadan, MD | Physician
    • mRNA post vaccination syndrome: Is it real?

      Harry Oken, MD | Conditions
  • Recent Posts

    • Why AI in health care needs stronger testing before clinical use [PODCAST]

      The Podcast by KevinMD | Podcast
    • How AI is reshaping preventive medicine

      Jalene Jacob, MD, MBA | Tech
    • How transplant recipients can pay it forward through organ donation

      Deepak Gupta, MD | Physician
    • Inside the high-stakes world of neurosurgery

      Isaac Yang, MD | Conditions
    • Why I left the clinic to lead health care from the inside

      Vandana Maurya, MHA | Conditions
    • How doctors can think like CEOs [PODCAST]

      The Podcast by KevinMD | Podcast

MedPage Today Professional

An Everyday Health Property Medpage Today
  • Terms of Use | Disclaimer
  • Privacy Policy
  • DMCA Policy
All Content © KevinMD, LLC
Site by Outthink Group

Leave a Comment

Comments are moderated before they are published. Please read the comment policy.

Loading Comments...