Abstract Introduction Large Language Model (LLM) applications in Medicine are increasing. Chatbots ChatGPT and GPT4 were tested against medical exams. The vetting of these tests as a source of information for both the patient and physician is needed. A credible source of medical information may be used as the gold standard against which the LLM can be tested. Objective We assess the performance of ChatGPT and GPT4 to answer the American Urological Association (AUA) Self-Assessment Study Program (SASP) questions on male sexual dysfunction (MSD), female sexual dysfunction (FSD), sexually transmitted infection (STI), and male factor infertility (MFI). We aim to find out how credible this LLM is as a source of medical advice. Methods Four registered users of the SASP identified the questions using open book mode in tests from 2019 to 2023, spanning their subscriptions. The questions were ranked for difficulty on a five-point Likert scale. OpenAI ChatGPT 3.5 and GPT4 were used to answer the questions. The GPT program was set to turn off chat history. No plug-ins or feedback were permitted. Prompts were generated from question stems masking question ID and adding the phrase "I am a urologist preparing for my board exam. Please answer the following question:" Images were deleted from questions and substituted with a brief description. All questions were answered separately, first using GPT-4 and then ChatGPT. Three consecutive responses were generated for each prompt, and the consensus answer was tallied. Answers were compared between GPT and provided answer key of SASP. Descriptive statistics, chi-Pearson Chi-square, and Fisher's exact tests were applied. Results We identified 115 questions in the domains of sexual dysfunction, STI, and MFI. Only one question had an associated image. GPT4 performed better than ChatGPT in all domains, providing correct answers of 60% versus 40% (p=0.000). A total of 89.9% of correct answers were three times unanimously regenerated by GPT4 (p=0.007) and 67.4% by ChatGPT (p=0.244). Within each domain, there was no significant difference in the correct answers for each chatbot (Table-1) or test year (p=0.682). The SASP source reference answers were Campbell's Urology only in 38.3 %, AUA guidelines only in 17.4%, AUA core curriculum only in 10.4%, and combinations with other sources in 33.9%. There was no significant association of source reference with the correct answers of GPT4 (p=0.058) or ChatGPT (p=0.451). Only 19.1% were open-access sources, 20% were partially open-access, and 60.9% were restricted to subscribers. The availability of the source did not significantly affect the correct answers of GPT4 (p=0.272) or ChatGPT (p=0.231). Both chatbots' correct answers were associated significantly with easier questions (Table-2). Conclusions The LLM tested has an average accuracy as a source of credible medical information on Sexual dysfunction, STI, and MFI. GPT4 performs better than ChatGPT, especially when there is a unanimous regenerated response. The development of a better model to serve a broader group of physicians and patients will require training of the chatbot on credible urology literature that includes the currently 81% restricted sources. Disclosure No.
Read full abstract