Evaluating Generative AI Large Language Models for Urticaria Management: A Comparative Analysis of DeepSeek-R1 and ChatGPT-4o

Fuente: PubMed "hive"

Clin Transl Allergy. 2025 Nov;15(11):e70113. doi: 10.1002/clt2.70113.ABSTRACTINTRODUCTION: Urticaria is a prevalent condition affecting a significant portion of the global population. Both dermatologists and patients require access to up-to-date and accurate information. Traditional search engines often fall short in meeting these needs. Despite the growing reliance on AI for medical inquiries, the accuracy and quality of AI-generated remain understudied. This study aims to evaluate and compare the performance of two widely used AI models, ChatGPT-4o and DeepSeek-R1, in addressing urticaria-related queries.METHODS: An e-Delphi procedure was employed to generate and refine a set of urticaria-related questions, as well as to develop an evaluation framework for AI-generated responses. ChatGPT-4o and DeepSeek-R1 were then prompted with the finalized questions, and their responses were recorded. A single-blind comparative assessment was conducted among 67 participants (29 dermatologists and 38 non-dermatologists). The responses from both AI models were assessed across simplicity, accuracy, professionalism, clinical feasibility, comprehensibility, and completeness.RESULTS: DeepSeek-R1 outperformed ChatGPT-4o in most metrics. Dermatologists rated DeepSeek significantly higher in simplicity (p < 0.001), accuracy (p < 0.001), completeness (p = 0.001), professionalism (p < 0.001), and clinical feasibility (p < 0.001). Non-dermatologists found DeepSeek's responses more concise (p < 0.001) and comprehensible (p < 0.001). Both models showed comparable integration of cutting-edge knowledge (p = 0.06), though DeepSeek exhibited greater output stability, as evidenced by lower standard deviations. When compared with the guidelines, the answers provided by DeepSeek-R1 contained no errors, while ChatGPT-4o made errors in three clinical questions.CONCLUSION: AI-generated answers require rigorous evaluation to ensure their reliability and suitability for medical applications. Based on the current study, DeepSeek-R1 outperforms ChatGPT-4o in addressing urticaria-related queries, demonstrating higher potential for both clinical and patient use.PMID:41306070 | PMC:PMC12658338 | DOI:10.1002/clt2.70113

Volver