The limits of large language models in clinical practice

Artificial intelligence is no longer a distant concept in modern medicine. It is already entering clinical workflows through tools that draft patient notes, summarize charts, generate patient education materials, and assist with decision-making. At the center of this shift are large language models such as ChatGPT, Med-PaLM, and other health care-adapted systems. For clinicians, the most important question is not whether these tools are impressive. It is whether they understand what large language models actually do, where large language models help, and where large language models can mislead them.

That question matters because artificial intelligence is arriving at a particularly difficult moment in medicine. Patient demand continues to rise. Workforce shortages persist. Care is becoming more complex. Administrative burden keeps expanding. Burnout is no longer an abstract concern; it is part of the daily reality of clinical practice. In that environment, artificial intelligence is not entering health care as a novelty. It is being introduced as a possible response to a system already under significant strain. The real issue is not whether artificial intelligence will replace physicians. It will not. The issue is whether it can meaningfully reduce some of the pressure that is making modern medicine harder to sustain.

How large language models work

Large language models are often described as intelligent, but that description can be misleading. They are not clinical reasoning engines. They are highly advanced language prediction systems designed to generate the next most likely word or phrase based on the text that comes before it. Large language models are trained on enormous datasets and become very good at recognizing patterns, relationships, and structure in language.

This distinction is critical. Large language models can produce responses that sound fluent, polished, and clinically plausible. But fluency is not the same as understanding. These systems do not know pathophysiology. They do not think through uncertainty the way a clinician does. They do not build a differential diagnosis from first principles or apply judgment in the human sense. Their strength lies in pattern recognition at a vast scale, not true reasoning.

Large language model development typically happens in stages. First is pretraining on vast amounts of text, which may include medical literature, guidelines, educational resources, and other written material. This gives the model broad familiarity with language and domain-specific terminology. Next comes fine-tuning on narrower health care datasets, such as clinical notes, radiology reports, discharge summaries, or patient educational content. Some systems are further shaped through reinforcement learning from human feedback, in which clinicians or reviewers evaluate outputs and steer the model toward safer or more useful responses.

This process of training improves performance, but it does not eliminate risk. These models inherit both the strengths and the weaknesses of the data used to train them. If the source material is inconsistent, biased, outdated, or inaccurate, the model will reflect those problems. And because the practice of medicine changes quickly, even a well-performing model can become out-of-date unless it is regularly updated or paired with tools that retrieve and provide current information.

Data integrity and limitations in large language model training

Anyone who works in medicine understands how messy clinical documentation can be. Notes are filled with abbreviations, shorthand, copied-forward text, inconsistent formatting, outdated details, and fragmented narratives. Those same characteristics become a problem when these mistakes are absorbed into training datasets. The model learns not only medical language, but also those mistakes embedded within it.

The issue extends beyond messy notes. Training data may also include outdated recommendations, incorrect diagnoses, low-quality educational sources, demographically narrow studies, or content that has not been rigorously validated. Some datasets underrepresent important populations, including racial and ethnic minorities, pediatric and geriatric patients, low-income communities, and people with rare diseases. As a result, model performance may not be equally reliable across patient groups. In some cases, these tools risk reinforcing the very inequities medicine is already trying to address.

Even the technical side of data management can introduce problems. Duplicated encounters, coding errors, mislabeled images, and synthetic examples that do not reflect real-world care can all distort the model’s output. When that happens, the result may be an answer that appears coherent but is fundamentally wrong.

Clinical limitations of large language models

This leads to one of the best-known limitations of large language models: hallucination. These systems can generate false information with complete confidence. They may invent citations, fabricate guideline recommendations, misstate pathophysiology, or produce inaccurate clinical summaries that sound entirely reasonable. In a busy clinical environment, this is a serious risk. The smoother the language, the easier it is to miss the mistake.

More fundamentally, large language models do not exercise clinical judgment. They do not perform true Bayesian reasoning. Large language models do not recognize when a patient presentation is evolving away from the expected pattern. They do not cope with uncertainty, challenge assumptions, or change course when new facts are introduced. They cannot synthesize medical facts with ethics, family dynamics, psychosocial context, and bedside nuance the way experienced clinicians do every day.

Large language models also lack real-time awareness unless they are tightly integrated with the clinical environment. Without access to current vital signs, laboratory results, imaging, medication changes, nursing observations, and social context, they are generating language based on incomplete information. Even when they are integrated into health systems, large language models may still misinterpret conflicting or partial data.

Another major limitation is explainability. Large language models can produce an answer, but they cannot provide a transparent, auditable account of how they arrived there in a way that satisfies the standards of peer review, legal scrutiny, or formal clinical justification. In medicine, where documentation and accountability matter, that is a major limitation. The regulatory environment only adds to the uncertainty. The legal and policy framework for clinical artificial intelligence is still evolving. Questions remain about liability, documentation of artificial intelligence-assisted decisions, privacy protections, Food and Drug Administration oversight, and the degree to which clinicians can safely rely on machine-generated recommendations. At present, the clearest principle is that clinician oversight must remain central. Artificial intelligence may assist, but it cannot be the final decision-maker.

Where humans remain central

Medicine is not just an exercise in information retrieval. Clinical care requires interpretation, prioritization, communication, and accountability. Physicians and other clinicians integrate subtle findings across multiple domains, weigh competing possibilities, interpret uncertainty, and adapt decisions to the patient in front of them. They account for psychosocial realities, family concerns, values, culture, and goals of care. They make judgments that go beyond the words in a chart.

These strengths remain deeply human. Medicine is relational. Trust, empathy, shared decision-making, and therapeutic presence are not secondary to care; they are part of care itself. A patient’s willingness to disclose symptoms, follow recommendations, or navigate a serious illness often depends on the quality of that relationship. No language model can replace that. The growing presence of technology in medicine makes the human side of care even more important.

The emerging clinical value of large language models

At the same time, dismissing artificial intelligence and large language models would be a mistake. While large language models are primarily language tools, the broader ecosystem of clinical artificial intelligence is already demonstrating meaningful value in practice, especially in areas marked by information overload and time pressure. In radiology, emergency medicine, and hospital care, artificial intelligence systems are increasingly used to flag acute findings such as stroke, pulmonary embolism, and intracranial hemorrhage, which helps clinicians prioritize critical cases more quickly. In pathology and oncology, artificial intelligence can identify subtle patterns in tissue and imaging data, improving consistency and supporting faster workflows. These systems do not replace expertise. They help clinicians apply expertise more efficiently.

Artificial intelligence is also beginning to reshape the front end of care. Symptom assessment tools and virtual triage platforms may help patients navigate the system more effectively, potentially reducing unnecessary visits and improving alignment between patient needs and the care setting they choose. Predictive analytics are also creating opportunities for earlier identification of risk and more personalized treatment strategies by integrating clinical, genomic, and longitudinal data. Taken together, these developments suggest that artificial intelligence’s greatest contribution may not be replacing decision-making but improving how quickly and effectively clinicians can process complex information and act on it.

For now, the safest and most appropriate role for artificial intelligence is using large language models in clinical practice in structured, lower-risk support tasks. They can be useful for drafting histories and physicals, discharge summaries, referral letters, and patient instructions. They can help summarize long records, organize information, support coding when verified, and assist with literature synthesis and educational material. What they should not do is function autonomously in high-stakes clinical decisions. They should not independently diagnose new conditions, recommend treatment changes, interpret imaging or electrocardiograms without oversight, make triage decisions, or manage unstable patients. These are not just technical tasks. They require human judgment.

Safe use of these tools requires discipline. Clinicians should review and edit every output. They should use only institution-approved, privacy-compliant systems. Artificial intelligence-generated language should be treated as a draft rather than a conclusion. Most importantly, clinicians should document and rely on their own reasoning rather than simply inheriting the model’s phrasing. Large language models do offer real value in medicine, particularly for documentation, summarization, and information synthesis. But they are not substitutes for clinical reasoning. They do not understand disease, do not think causally, and do not carry responsibility.

Still, their arrival is not happening in a vacuum. They are entering a health care system weighed down by administrative overload, cognitive burden, and workforce strain. If implemented thoughtfully, artificial intelligence may help reduce some of that pressure. It may improve efficiency, support faster recognition of important information, and return time to clinicians. All of this matters, because the most valuable elements of medical care are the principles that technology cannot replace: thinking carefully, connecting with patients, exercising judgment, and delivering humane, high-quality care. The future of artificial intelligence in medicine should not be framed as physicians versus technology. It should be framed as physicians being supported by technology that can genuinely help. At its best, artificial intelligence does not diminish the clinician’s role. It supports and protects it.

Edward G. Rogoff is a professor of entrepreneurship and a patient advocate. Alena Ivashenka is a biotechnology and life sciences investment expert.