| Study | Journal | Task | Omission | n |
|---|---|---|---|---|
| CREOLA (Asgari et al. 2025) | Nature | Clinical note generation | 3.5% | |
| MedR-Bench (Qiu et al. 2025) | Nat. Comm. | Diagnostic reasoning recall | ~30% | 1,453 |
| Stanford (Grolleau et al. 2025) | | Discharge note completeness | ~35% | |
| This benchmark | | Treatment component omission | 60.2% | 493 |
496 treatment case reports from MedR-Bench (PMC Open Access, published after July 2024, ensuring no training-data contamination) were used. For each case, the structured patient summary was submitted to GPT-4o-mini with instructions to generate a detailed treatment plan including specific medications, dosages, procedures, monitoring, and timing. The AI-generated plan was then compared against the published treatment from the original case report. Each treatment component was independently classified as covered or missed. Omission rate = missed components / total components × 100%. 493 of 496 cases were successfully evaluated (3 excluded due to API errors). Evaluation follows the LLM-as-judge methodology of MedR-Bench (Qiu et al. 2025); judge model: GPT-4o-mini at temperature 0.1.