Expert Evaluation and Consensus on GPT-4o Summaries of Clinical Letters: Validation and Results of the Framework and Implementation of AI Tools Project

Fuente: PubMed "essential OR oil extract"

JMIR Med Inform. 2026 May 11;14:e90374. doi: 10.2196/90374.ABSTRACTBACKGROUND: Large language models (LLMs) are increasingly used to summarize clinical documents; yet, automated metrics often inadequately capture clinical relevance and safety. In the initial phase of the "Framework and Implementation of AI Tools," an expert-driven, cocreated evaluation methodology was established to assess LLM-generated discharge letter summaries, combining prompt design considerations with intuitive expert appraisal.OBJECTIVE: This study aimed to quantify expert agreement and interrater reliability on LLM summaries of discharge letters, identify frequent and clinically relevant errors, and evaluate practical implications for integrating LLMs into documentation workflows.METHODS: Thirty expert-curated synthetic Dutch discharge letters were summarized. Thirty-one clinicians from Flemish care settings (1 university hospital, 2 private hospitals, and 2 general practice circles) evaluated the summaries. The evaluation framework consisted of 61 binary layout items assessing whether required sections and formatting were correctly present, 33 content items (correct or complete vs incorrect, subcategorizing missing, irrelevant, and hallucinated information), a 4-point global quality rating, and an open comment. Statistical analyses included descriptive statistics, mixed effects ordinal regression on the global score, consensus (agreement per question or letter) percentages, interrater reliability (Cohen κ, intraclass correlation coefficient [ICC], Fleiss κ, and prevalence index), and thematic synthesis of comments.RESULTS: Layout adherence was high (88%), especially in the conclusion section. The positive response rate for content was overall moderate (78%), with the best performance observed in the medical history section and the lowest performance observed in the medication section, which also showed the highest rate of hallucinations and the weakest interrater consensus. Across all sections, missing information was the most common error. Nearly 70% of global ratings were "good" or "very good." Higher positive response rates for content predicted better global scores (β=.079; P<.001), while layout and participant specialty were not relevant to global scoring. Consensus was high for the layout questions (median 96.8%, IQR 90.2%-100%) and somewhat lower for content (median 83.9%, IQR 67.7-96.8), with the lowest agreement in the medication section. Interrater agreement was moderate (median Cohen κ=0.36, IQR 0.29-0.43; range 0.07-0.56), but overall reliability was high (ICC 0.945, 95% CI 0.942-0.948), indicating strong consistency at the global level despite interrater variability. The prevalence index demonstrated that high ICC values were partly driven by the strong prevalence of affirmative responses in layout items, while content items showed more balanced distributions and lower agreement.CONCLUSIONS: Our framework offers a robust approach for evaluating LLM-generated discharge summaries, balancing usability and clinical relevance. Semantic integrity, especially regarding medication details, was identified as a key vulnerability. Perceived overall quality was driven by a positive response rate for content. High ICC values for global score, with lower item-level agreement lead toward the need for clearer, context-specific prompts and standardized evaluation criteria to reduce interrater variability. Human oversight and targeted automated checks for omissions and hallucinations are essential for safe clinical deployment.PMID:42114040 | DOI:10.2196/90374

Volver