To trust or not to trust: evaluating the reliability and safety of AI responses to laryngeal cancer queries.

Magdalena Ostrowska,Jacek Banaszewski,Anitta Sisily Joseph,Katie Vaughan-Lane,Paulina Kacała,Maciej J Wróbel,Deborah Onolememen,Adam Ostrowski,Wioletta Pietruszewska

doi:10.1007/s00405-024-08643-8

Abstract

As online health information-seeking surges, concerns mount over the quality and safety of accessible content, potentially leading to patient harm through misinformation. On one hand, the emergence of Artificial Intelligence (AI) in healthcare could prevent it; on the other hand, questions raise regarding the quality and safety of the medical information provided. As laryngeal cancer is a prevalent head and neck malignancy, this study aims to evaluate the utility and safety of three large language models (LLMs) as sources of patient information about laryngeal cancer. A cross-sectional study was conducted using three LLMs (ChatGPT 3.5, ChatGPT 4.0, and Bard). A questionnaire comprising 36 inquiries about laryngeal cancer was categorised into diagnosis (11 questions), treatment (9 questions), novelties and upcoming treatments (4 questions), controversies (8 questions), and sources of information (4 questions). The population of reviewers consisted of 3 groups, including ENT specialists, junior physicians, and non-medicals, who graded the responses. Each physician evaluated each question twice for each model, while non-medicals only once. Everyone was blinded to the model type, and the question order was shuffled. Outcome evaluations were based on a safety score (1-3) and a Global Quality Score (GQS, 1-5). Results were compared between LLMs. The study included iterative assessments and statistical validations. Analysis revealed that ChatGPT 3.5 scored highest in both safety (mean: 2.70) and GQS (mean: 3.95). ChatGPT 4.0 and Bard had lower safety scores of 2.56 and 2.42, respectively, with corresponding quality scores of 3.65 and 3.38. Inter-rater reliability was consistent, with less than 3% discrepancy. About 4.2% of responses fell into the lowest safety category (1), particularly in the novelty category. Non-medical reviewers' quality assessments correlated moderately (r = 0.67) with response length. LLMs can be valuable resources for patients seeking information on laryngeal cancer. ChatGPT 3.5 provided the most reliable and safe responses among the models evaluated.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: European Archives of Oto-Rhino-Laryngology	Publication Date: Apr 23, 2024
Citations: 3	License type: CC BY 4.0

R Discovery Prime

R Discovery Prime

To trust or not to trust: evaluating the reliability and safety of AI responses to laryngeal cancer queries.

Abstract

Talk to us

Similar Papers

More From: European Archives of Oto-Rhino-Laryngology

Lead the way for us

Similar Papers

How Can IJDS Authors, Reviewers, and Editors Use (and Misuse) Generative AI?
Galit Shmueli ... Bianca Maria Colosimo
INFORMS Journal on Data Science | VOL. 2
Galit Shmueli, et. al.Galit Shmueli ... Bianca Maria Colosimo
01 Apr 2023
INFORMS Journal on Data Science | VOL. 2

Artificial Intelligence in Dental Education: Opportunities and Challenges of Large Language Models and Multimodal Foundation Models.
Daniel Claman ... Emre Sezgin
JMIR medical education | VOL. 10
Daniel Claman, et. al.Daniel Claman ... Emre Sezgin
05 Sep 2023
JMIR medical education | VOL. 10

Emergent Cooperation and Strategy Adaptation in Multi-Agent Systems: An Extended Coevolutionary Theory with LLMs
I De Zarzà ... Pietro Manzoni
Electronics | VOL. 12
I De Zarzà, et. al.I De Zarzà ... Pietro Manzoni
18 Jun 2023
Electronics | VOL. 12

Can large language models provide accurate and quality information to parents regarding chronic kidney diseases?
Rüya Naz ... Okan Akacı
Journal of evaluation in clinical practice | VOL. -
Rüya Naz, et. al.Rüya Naz ... Okan Akacı
03 Jul 2024
Journal of evaluation in clinical practice | VOL. -

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

To trust or not to trust: evaluating the reliability and safety of AI responses to laryngeal cancer queries.

Abstract

Talk to us

Similar Papers

More From: European Archives of Oto-Rhino-Laryngology