Validating AI in health care: the role of real-world evidence

Before complex medical interventions ever reach patients, their effectiveness must be scientifically proven through rigorous testing in controlled settings such as laboratories and research facilities. But ensuring effectiveness doesn’t stop there. The process of assessing a medical treatment’s usefulness is multilayered and requires investigation that goes well beyond the laboratory. It must also assess how the treatment fares in day-to-day clinical practice for a wide variety of patients. Now, as AI takes on a larger role in care delivery, the same need exists. AI’s performance must be evaluated in real-world clinical practice, not just via controlled datasets or solitary training environments.

Fortunately, although AI is a new addition to the health care toolkit, a roadmap for assessing AI’s effectiveness already exists in methodologies for evaluating real-world data (RWD) and real-world evidence (RWE). Long used in post-market surveillance and regulatory decision-making for health care interventions, RWE studies are an established practice for the continuous, outcomes-driven assessment that ensures a treatment’s safety, effectiveness, generalizability, and equity. But applying this practice to assess AI-supported treatments requires data discipline and routine updates. AI has the potential to revolutionize the way we address serious health concerns. However, to use the technology safely, it must be held to the same level of ongoing scrutiny and real-world discipline as any other treatment intervention used in real-life clinical practice.

Adaptive tools need adaptive oversight and validation

Many AI models aren’t static. Their behavior evolves based on shifting data and new information. They change when retrained or updated and learn new patterns through a series of complex and layered steps. The performance of any AI, including those used in health care, can also decay (i.e., experience “model drift”) as populations, diagnosis coding, or practice patterns shift. Feedback loops can amplify errors and subgroup gaps may widen quietly over time. Version updates and data pipeline tweaks can also alter behavior without obvious signals. With critical elements of our health care system at stake, detecting these kinds of subtle but impactful shifts with real-world validation is essential. Not only can this kind of validation help prove an algorithm works at launch, but it can also keep it safe, effective, and equitable as the model evolves and the care environment changes. Getting this validation right is where RWE studies come in.

RWE’s unique potential in a world of AI-supported health care

Real-world evidence (RWE) is the rigorous analysis of real-world data (RWD), data that can be aggregated from electronic health records (EHRs), claims, registries, medical devices, and patient-reported information, to evaluate the safety and effectiveness of care in everyday practice. Rooted in pharmacoepidemiology and post-market surveillance, RWE leverages the observational data from the day-to-day provision of health care to uncover benefits and risks that tightly controlled clinical trials may not capture.

As EHR adoption, common data standards, data linkages across systems, and modern causal methods have matured, regulators have also begun formally using RWE alongside trials, first for safety and labeling, then for coverage, quality, and comparative effectiveness. When combined with clinical trials, RWE adds value by demonstrating how interventions perform across diverse patient populations and settings, at scale and over time.

This approach is a natural fit for health care AI. RWE can test algorithms across populations, workflows, and datasets. Done well, it can show how technical metrics like discrimination, calibration, or drift are linked to clinical and quality outcomes such as utilization and error rates, tracking performance over time, and logging meaningful changes. RWE studies work best when they follow a pre-specified analysis plan, offering a transparent and practical way to show when an AI tool is safe, effective, and equitable in real-world care, and to reveal where it isn’t.

Getting RWE studies right, and why it matters

RWE helps answer lingering questions about a health care intervention’s effectiveness outside of controlled trials and ensures its ongoing effectiveness over time. But any RWE study is only as strong as its foundations. Fit-for-purpose data, a clear question, and transparent methods are required for strong and reliable RWE studies. If the study data are incomplete, inconsistently coded, or poorly linked, results can mislead or amplify risks like residual confounding, selection bias, and misclassification. Credible RWE demonstrates its work by documenting data provenance, employing rigorous design and causal methods, conducting sensitivity analyses, and selecting outcomes that matter clinically.

To effectively use RWE to evaluate AI, the study must be performed in settings where AI models are typically used, on patients and workflows they impact, with outcomes that matter. The result ensures that the AI model’s output remains safe, effective, and fair as data, practice, and the models themselves evolve. It is an investment in quality that developers of AI tools must make, and users must demand of the products they introduce into their practices.

Applying RWE across the AI product lifecycle

The U.S. Food and Drug Administration (FDA) recognizes this need and has incorporated RWE into its total product life cycle (TPLC) framework for medical devices, including those that are AI-enabled. This framework guides medical device manufacturers and regulators through the design, validation, and post-market monitoring of medical technologies and can serve as a guidepost for future development of other forms of health care AI.

Integrating RWE throughout this lifecycle enables continuous oversight at each stage.

During design and development, RWE helps product developers understand user needs, patient diversity, and real-world contexts.
In testing and validation, RWE can confirm that an AI model performs consistently across different populations and care settings through external validation, subgroup analyses, and workflow assessments.
After market release, RWE supports post-market surveillance, detecting issues such as data drift, bias, or performance degradation over time.

RWE serves as both a quality assurance tool and a feedback mechanism, ensuring that AI systems evolve responsibly and are informed by real-world healthcare scenarios.

A final word: Building trust in AI through evidence

The future of AI in health care depends on a thoughtful balance of innovation with diligent accountability and evidence. RWE study frameworks offer a proven, pragmatic pathway to assess whether AI truly benefits patients in the settings where care happens every day. But RWE studies themselves must be applied carefully and consistently to provide the most benefit.

Applying established RWE principles to the validation of AI-enabled medical treatments is essential for health care organizations, regulators, and developers to align around a shared goal: ensuring that AI improves health outcomes for all. AI’s potential is extraordinary. Realizing that potential requires the same discipline, transparency, and continuous learning that define the best of medicine itself.

Jeanna Blitz is a physician executive.