Effect of Region-of-Interest Prompting on Gemini 2.5 Pro in MRI Classification of Anterior Cruciate Ligament Injury

Fuente: PubMed "rice"

Cureus. 2026 Feb 2;18(2):e102850. doi: 10.7759/cureus.102850. eCollection 2026 Feb.ABSTRACTBACKGROUND: Artificial intelligence (AI) has shown promise in musculoskeletal imaging, yet the diagnostic contribution of large language models (LLMs) remains unclear. Prompt engineering may critically shape performance.OBJECTIVE: To evaluate the diagnostic accuracy of Google Gemini 2.5 Pro in classifying anterior cruciate ligament (ACL) status on knee magnetic resonance imaging (MRI) and to compare three prompting strategies; the primary endpoint was weighted F1-score.METHODS: A retrospective diagnostic study used 150 proton-density fat-suppressed (PD-FS) knee MRI volumes (50 each: healthy, partially injured, completely ruptured) drawn from a publicly available dataset (Clinical Hospital Centre Rijeka, Croatia; 2006-2014). Gemini 2.5 Pro received multimodal inputs via the official Python software development kit (SDK). Three prompts were tested: (P1) general series prompt, (P2) technical-description prompt, and (P3) region-of-interest (ROI)-focused prompt. Outputs (A = healthy, B = partial, C = ruptured) were compared with radiologist labels. Accuracy, precision, recall, specificity, F1 score, confusion matrices, and mean inference time were computed (scikit-learn v1.5.0). Ethical approval was waived because the data were de-identified and publicly available.RESULTS: Mean inference time was 2.1 ± 0.3 seconds per volume. ROI prompting (P3) yielded the highest weighted F1-score (0.31), while macro recall (0.35) and macro specificity (0.67) were similar across prompts. Confusion matrices showed improved discrimination of completely ruptured ACLs with P3.CONCLUSIONS: Despite a minor improvement in the weighted F1-score with Prompt 3, all prompts demonstrate poor overall classification performance, with low sensitivity and accuracy. The consistently overlapping confidence intervals indicate that prompt variations alone are insufficient to meaningfully enhance model performance. These findings suggest fundamental limitations in the model's ability to handle this task rather than suboptimal prompting.PMID:41798494 | PMC:PMC12961630 | DOI:10.7759/cureus.102850

Volver