It is unknown whether ChatGPT provides quality responses to infectious diseases (ID) pharmacotherapy questions. This study surveyed ID pharmacist subject matter experts (SMEs) to assess the quality of ChatGPT version 3.5 (GPT-3.5) responses. The primary outcome was the percentage of GPT-3.5 responses considered useful by SME rating. Secondary outcomes were SMEs' ratings of correctness, completeness, and safety. Rating definitions were based on literature review. One hundred ID pharmacotherapy questions were entered into GPT-3.5 without custom instructions or additional prompts, and responses were recorded. A 0-10 rating scale for correctness, completeness, and safety was developed and validated for interrater reliability. Continuous and categorical variables were assessed for interrater reliability via average measures intraclass correlation coefficient and Fleiss multirater kappa, respectively. SMEs' responses were compared by the Kruskal-Wallis test and chi-square test for continuous and categorical variables. SMEs considered 41.8% of responses useful. Median (IQR) ratings for correctness, completeness, and safety were 7 (4-9), 5 (3-8), and 8 (4-10), respectively. The Fleiss multirater kappa for usefulness was 0.379 (95% CI, .317-.441) indicating fair agreement, and intraclass correlation coefficients were 0.820 (95% CI, .758-.870), 0.745 (95% CI, .656-.816), and 0.833 (95% CI, .775-.880) for correctness, completeness, and safety, indicating at least substantial agreement. No significant difference was observed among SME responses for percentage of responses considered useful. Fewer than 50% of GPT-3.5 responses were considered useful by SMEs. Responses were mostly considered correct and safe but were often incomplete, suggesting that GPT-3.5 responses may not replace an ID pharmacist's responses.
Read full abstract