Treatment Omission in AI-Generated Clinical Plans:
A Systematic Benchmark of 493 Cases

MedR-Bench Treatment Dataset (Qiu et al., Nature Communications 2025) · GPT-4o-mini · March 2026

60.2%
Average treatment information omitted by GPT-4o-mini across 493 published case reports and 13 body systems
493
Cases Evaluated
318
Cases >50% Omission
3.0
Critical Missed / Case
13
Body Systems
Figure 1 — Omission rate by body system (%)

Table 1 — Selected Completeness Studies in Medical AI

StudyJournalTaskOmissionn
CREOLA (Asgari et al. 2025)NatureClinical note generation3.5%—
MedR-Bench (Qiu et al. 2025)Nat. Comm.Diagnostic reasoning recall~30%1,453
Stanford (Grolleau et al. 2025)—Discharge note completeness~35%—
This benchmark—Treatment component omission60.2%493
Studies differ in task, dataset, and evaluation methodology; figures are not directly comparable. To our knowledge, this is the first systematic benchmark measuring treatment component omission in AI-generated clinical plans.

Method

496 treatment case reports from MedR-Bench (PMC Open Access, published after July 2024, ensuring no training-data contamination) were used. For each case, the structured patient summary was submitted to GPT-4o-mini with instructions to generate a detailed treatment plan including specific medications, dosages, procedures, monitoring, and timing. The AI-generated plan was then compared against the published treatment from the original case report. Each treatment component was independently classified as covered or missed. Omission rate = missed components / total components × 100%. 493 of 496 cases were successfully evaluated (3 excluded due to API errors). Evaluation follows the LLM-as-judge methodology of MedR-Bench (Qiu et al. 2025); judge model: GPT-4o-mini at temperature 0.1.