ChatGPT Performance in Diagnostic Clinical Microbiology Laboratory-Oriented Case Scenarios.

Malik Sallam,Eyad Al-Ajlouni,Khaled Al-Salahat

doi:10.7759/cureus.50629

Abstract

Artificial intelligence (AI)-based tools can reshape healthcare practice. This includes ChatGPT which is considered among the most popular AI-based conversational models. Nevertheless, the performance of different versions of ChatGPT needs further evaluation in different settings to assess its reliability and credibility in various healthcare-related tasks. Therefore, the current study aimed to assess the performance of the freely available ChatGPT-3.5 and the paid version ChatGPT-4 in 10 different diagnostic clinical microbiology case scenarios. The current study followed the METRICS (Model, Evaluation, Timing/Transparency, Range/Randomization, Individual factors, Count, Specificity of the prompts/language) checklist for standardization of the design and reporting of AI-based studies in healthcare. The models tested on December 3, 2023 included ChatGPT-3.5 and ChatGPT-4 and the evaluation of the ChatGPT-generated content was based on the CLEAR tool (Completeness, Lack of false information, Evidence support, Appropriateness, and Relevance) assessed on a 5-point Likert scale with a range of the CLEAR scores of 1-5. ChatGPT output was evaluated by two raters independently and the inter-rater agreement was based on the Cohen's κ statistic. Ten diagnostic clinical microbiology laboratory case scenarios were created in the English language by three microbiologists at diverse levels of expertise following an internal discussion of common cases observed in Jordan. The range of topics included bacteriology, mycology, parasitology, and virology cases. Specific prompts were tailored based on the CLEAR tool and a new session was selected following prompting each case scenario. The Cohen's κ values for the five CLEAR items were 0.351-0.737 for ChatGPT-3.5 and 0.294-0.701 for ChatGPT-4 indicating fair to good agreement and suitability for analysis. Based on the average CLEAR scores, ChatGPT-4 outperformed ChatGPT-3.5 (mean: 2.64±1.06 vs. 3.21±1.05,P=.012, t-test). The performance of each model varied based on the CLEAR items, with the lowest performance for the "Relevance" item (2.15±0.71 for ChatGPT-3.5 and 2.65±1.16 for ChatGPT-4). A statistically significant difference upon assessing the performance per each CLEAR item was only seen in ChatGPT-4 with the best performance in "Completeness", "Lack of false information", and "Evidence support" (P=0.043). The lowest level of performance for both models was observed with antimicrobial susceptibility testing (AST) queries while the highest level of performance was seen in bacterial and mycologic identification. Assessment of ChatGPT performance across different diagnostic clinical microbiology case scenarios showed that ChatGPT-4 outperformed ChatGPT-3.5. The performance of ChatGPT demonstrated noticeable variability depending on the specific topic evaluated. A primary shortcoming of both ChatGPT models was the tendency to generate irrelevant content lacking the needed focus. Although the overall ChatGPT performance in these diagnostic microbiology case scenarios might be described as "above average" at best, there remains a significant potential for improvement, considering the identified limitations and unsatisfactory results in a few cases.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

ChatGPT Performance in Diagnostic Clinical Microbiology Laboratory-Oriented Case Scenarios.

Abstract

Talk to us

Similar Papers

More From: Cureus

Lead the way for us

Similar Papers

An Urban Population Health Observatory for Disease Causal Pathway Analysis and Decision Support: Underlying Explainable Artificial Intelligence Model
Whitney S Brakefield ... Nariman Ammar
JMIR Formative Research | VOL. 6
Whitney S Brakefield, et. al.Whitney S Brakefield ... Nariman Ammar
20 Jul 2022
JMIR Formative Research | VOL. 6

Is Tutor Performance Dependent on the Tutorial Group's Productivity?: Toward Further Resolving of Inconsistencies in Tutor Performance
Diana H J M Dolmans ... Cees P M Van Der Vleuten
Teaching and Learning in Medicine | VOL. 11
Diana H J M Dolmans, et. al.Diana H J M Dolmans ... Cees P M Van Der Vleuten
01 Oct 1999
Teaching and Learning in Medicine | VOL. 11

ONLINE CLINICAL ASSESSMENT OF PAEDIATRIC SUB-INTERNSHIP STUDENTS DURING THE COVID-19 PANDEMIC.
Gowda P Prashanth ... Sanam Anwar
Journal of Paediatrics and Child Health | VOL. 57
Gowda P Prashanth, et. al.Gowda P Prashanth ... Sanam Anwar
18 May 2021
Journal of Paediatrics and Child Health | VOL. 57

Drug-related problems in geriatric rehabilitation patients after discharge - A prevalence analysis and clinical case scenario-based pilot study.
Johanna Freyer ... Susanne Schiek
Research in social & administrative pharmacy : RSAP | VOL. 14
Johanna Freyer, et. al.Johanna Freyer ... Susanne Schiek
25 Jul 2017
Research in social & administrative pharmacy : RSAP | VOL. 14

Journal: Cureus	Publication Date: Dec 16, 2023
Citations: 12

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

ChatGPT Performance in Diagnostic Clinical Microbiology Laboratory-Oriented Case Scenarios.

Abstract

Talk to us

Similar Papers

More From: Cureus