Medical statistics errors: How bad data hurts clinicians

Last winter, a senior nurse in our psychiatric unit told me, “The dashboard says we’re low-risk. But during night shifts, I don’t even feel safe walking to the bathroom.”

The monthly quality report on her desk said the same thing it had said for nearly a year: “Violence incidents: no significant difference among the three wards (p > .05).”

On paper, her ward looked normal. At the bedside, it was anything but.

Her unit cared for more high-acuity patients, had much higher turnover, and used restraints more frequently. Staff were not the problem. Patients were not the problem. The statistics were.

The mistake: treating event counts as if they were average scores

The reassuring report was based on a very common statistical error. The analyst used ANOVA, a method designed to compare averages, to compare counts of violent incidents.

In hospitals, there are two very different kinds of numbers:

Counts: How many times something happened (20 violent incidents, 7 falls, 6 code blues).
Means: How large something is on average (average documentation hours, average pain scores, average blood pressure).

Counts answer “how many.” Means answer “how much.” They are not interchangeable.

In our hospital, the three wards reported:

Ward A (psychiatric): 20 violence incidents
Ward B (medical): 7 incidents
Ward C (surgical): 6 incidents

To any clinician, the difference is obvious. But ANOVA does not see “20 vs. 7 vs. 6” the way we see it. It transforms them into averages per patient. If each ward cared for about 100 patients, the numbers become:

0.20 incidents per patient
0.07 incidents per patient
0.06 incidents per patient

Once converted, the dramatic difference collapses into three small decimals. Because the event counts are low and because ANOVA is not designed for yes-or-no events, it easily concludes that the difference might be random. The official report then states: no significant difference.

It is like using a ruler to decide how many cats you have. The wrong tool makes very different groups appear the same. A chi-square test, which is designed for categorical counts, would almost certainly have flagged Ward A as truly higher risk.

But using the wrong method produced the wrong message: All wards are the same.

The human consequences of “no significant difference”

Once the report was distributed, the consequences were immediate and painful.

Requests for additional staff from the psychiatric unit were denied. Leadership believed the ward’s risk was not statistically higher.
Concerns from frontline nurses were reframed as emotional rather than evidence-based.
Administrators felt confident in the p-value, thinking they were being fair.

Meanwhile, the gap between data and reality grew wider.

Nurses learned a frustrating lesson: The numbers on the slide deck do not describe the world they work in. Some left. Those who stayed carried the workload and the emotional weight.

Then the AI system arrived, trained on the same flawed numbers

Three months later, the hospital introduced an AI tool to predict agitation and violence. The idea was simple: train the model on past incidents, then flag high-risk patients.

But the AI learned from the same statistical misunderstanding that claimed all three wards had the same risk. To the algorithm, every ward looked similar.

The psychiatric ward soon became flooded with alerts. Medium-risk patients were labeled high-risk, while genuinely unstable patients were occasionally missed. A junior nurse told me, “When everyone is high-risk, no one is high-risk.”

Alert fatigue set in. A tool designed to increase safety was now undermining trust.

When AI overrules clinical instincts

During one busy evening, our 62-year-old attending physician checked the AI overlay on a newly admitted patient. The display showed a calm green label: low risk of agitation.

The charge nurse disagreed. She noticed the patient’s pacing, facial tension, and escalating voice. “I have a bad feeling about this,” she said.

Pressed for time and seeing the AI’s confident label, the attending sided with the model. 10 minutes later, the patient punched a resident in the face.

Afterward, the attending said quietly, “Maybe I’m getting old. Maybe the AI sees things I don’t.”

But the AI was not seeing more. It was repeating the wrong statistics it had been trained on. The harm was not only the physical injury. It was the self-doubt planted in a clinician with decades of experience.

A second problem: stopping at ANOVA and skipping post-hoc tests

Another mistake came from a different type of analysis.

When the hospital compared average documentation time across three departments, ANOVA was correctly used. The p-value was less than 0.01, showing a real difference. But the analysis stopped there. No one asked the next question: Exactly which departments differ from one another?

Post-hoc tests, such as Tukey’s test, answer that question. They can reveal findings such as:

Department Z documents significantly more than Departments X and Y.
Departments X and Y are not significantly different from each other.

Without that step, leadership responded with a blanket policy: “Everyone must reduce documentation time by 20 minutes.”

The department drowning in paperwork received no targeted help. The other two were forced to cut time they did not have, just to meet a number.

When results like this feed AI models that attempt to identify “inefficient” units, the algorithm quietly learns the same vague message: Everyone is part of the problem.

How these statistical choices affect clinicians

These mistakes do not stay inside spreadsheets. They show up as:

False reassurance
False alarms
Automation bias
Erosion of clinical judgment
Loss of trust in data and AI
Frontline fatigue

This is how bad statistics hurt good clinicians.

The solution is basic, not high-tech.

Protecting clinicians in the age of AI starts long before the algorithm. It begins with the data.

Use chi-square for event counts.
Use ANOVA for averages.
Follow ANOVA with post-hoc tests when appropriate.
Pair p-values with simple counts and percentages.
Recognize that “not significant” does not always mean “no difference.”
Teach clinicians just enough statistics to ask, “What exactly are we comparing?”
Make sure AI systems learn from correctly analyzed data.

This is not about turning clinicians into statisticians. It is about giving them trustworthy numbers.

AI does not erode clinical judgment; bad data does

When our statistics are wrong, our AI will be wrong. When AI is wrong, clinicians doubt themselves.

AI did not tell the psychiatric nurse her ward was safe. The misused ANOVA did. AI did not weaken the attending’s instincts. A long chain of statistical shortcuts did.

Protecting clinical judgment in the age of AI does not start with the algorithm. It starts with the numbers we feed into it, and with listening to the clinicians who knew something was wrong long before the p-value did.

Gerald Kuo, a doctoral student in the Graduate Institute of Business Administration at Fu Jen Catholic University in Taiwan, specializes in health care management, long-term care systems, AI governance in clinical and social care settings, and elder care policy. He is affiliated with the Home Health Care Charity Association and maintains a professional presence on Facebook, where he shares updates on research and community work. Kuo helps operate a day-care center for older adults, working closely with families, nurses, and community physicians. His research and practical efforts focus on reducing administrative strain on clinicians, strengthening continuity and quality of elder care, and developing sustainable service models through data, technology, and cross-disciplinary collaboration. He is particularly interested in how emerging AI tools can support aging clinical workforces, enhance care delivery, and build greater trust between health systems and the public.