When a recent study in Nature Medicine compared general-purpose AI models with specialized clinical tools, the headline was exactly the kind the technology world likes to see: The generalists came out on top.
The reactions were predictable. Some observers treated the study as proof that specialized clinical software is no longer needed, arguing that bigger and more powerful AI models will always win. Others dismissed the study, pointing out that AI changes so quickly that any published comparison is already out of date by the time it appears. But I believe both reactions miss the deeper issue. The problem is that many current AI evaluations ignore a central part of clinical medicine: knowing when not to answer.
In medicine, staying silent does not always mean you do not know. Pausing before answering is not a weakness. Sometimes, the most responsible thing a clinician can do is hold back. Every practicing physician understands this instinctively. We are trained not only to gather medical facts but also to recognize the limits of the information before us. We look for what is missing.
That is why one of the most interesting measures in the Nature Medicine study was not simply accuracy, but refusal. UpToDate Expert AI reportedly declined to answer 19 percent of real clinical questions, far more often than the general-purpose models. In a standard AI benchmark, this looks like failure. The model did not satisfy the prompt. It left a blank. It seemed less useful. But in real clinical care, saying, “I cannot answer that safely from the information provided,” may be exactly what a patient needs.
In software, refusal is often treated as a problem to solve. In medicine, refusal can be a safety mechanism. A chatbot that always produces a confident answer may appear more intelligent than a clinical tool that pauses, narrows the question, or declines to respond. But patients are not protected by smooth language.
This is where current AI evaluations fall short. They reward answers that sound smooth, complete, and confident. They ask whether the AI answered the question, but they too rarely ask whether the question was safe to answer in the first place. On a spreadsheet, a formatting mistake and a dangerous pediatric drug dose may both appear as “errors.” In real life, they are not remotely equivalent. A minor mistake in a low-stakes explanation of disease physiology is one thing. Inventing a contraindication, overlooking pregnancy, or recommending a medication without adjusting for severe renal impairment can be catastrophic. If our evaluation systems score all errors as if they carry the same clinical consequence, we will incentivize developers to build the wrong tools.
This is not a criticism of general-purpose AI models. Like many physicians, I use AI regularly to help with documentation, summarize long notes, and organize messy clinical narratives. For administrative and cognitive support, these tools can be extremely useful. But clinical decision support is different. It is not just about finding the right fact or producing the most polished explanation. It is about managing risk when information is incomplete.
That is where I think the board-exam mindset of AI benchmarking breaks down. Standardized medical questions are tidy. They usually include the necessary facts and are designed to lead toward a correct answer. Real patients are rarely so easy. A model that performs well when all relevant facts are already present may not be the safest when the most important fact is missing.
Sometimes an AI tool should summarize. Sometimes it should retrieve a guideline. Sometimes it should identify a missing variable, such as creatinine clearance, gestational age, QT interval, culture data, or medication history. And sometimes it should say: This situation is too risky for an automated answer, and specialist input or urgent clinical evaluation is needed. That may frustrate a benchmark. But it often mirrors safe clinical practice.
Medical AI evaluations need to become abstention-aware. Refusal should not automatically be scored as failure. It should be examined. Was the prompt underspecified? Was the clinical risk high? Did the tool identify the missing information? Did it redirect safely? Did it protect the patient from a plausible but unsafe answer?
We also need consequence-weighted scoring. A harmless omission should not be treated the same as a potentially lethal hallucination. A model that refuses appropriately in high-risk situations may be safer than one that answers every question with confidence. If the market keeps rewarding models that never pause, developers will build systems optimized for relentless answer production. Our clinics will end up with tools that sound decisive precisely when medicine requires humility.
Restraint is part of good clinical care. Not prescribing, not reassuring, not discharging, and not answering prematurely are often among the most important decisions a physician makes. The future of medical AI should be shaped by whether our tools learn the same lesson we teach every medical trainee: Answer when you can, ask when you must, and stop when answering would be unsafe.
The safest medical AI may not be the one that always has an answer. It may be the one that knows when not to.
Timothy Lesaca is a psychiatrist in private practice at New Directions Mental Health in Pittsburgh, Pennsylvania, with more than forty years of experience treating children, adolescents, and adults across outpatient, inpatient, and community mental health settings. He has published in peer-reviewed and professional venues including the Patient Experience Journal, Psychiatric Times, the Allegheny County Medical Society Bulletin, and other clinical journals, with work addressing topics such as open-access scheduling, Landau-Kleffner syndrome, physician suicide, and the dynamics of contemporary medical practice. His recent writing examines issues of identity, ethical complexity, and patient–clinician relationships in modern health care. Additional information about his clinical practice and professional work is available on his website, timothylesacamd.com. His professional profile also appears on his ResearchGate profile, where further publications and details may be found.

















