Abstract Introduction Testosterone has become an increasing cultural touchstone. It is estimated that one third of men on testosterone replacement are not testosterone deficient and that another large percentage of men who would qualify for treatment do not receive it. This is likely partially due to the significant amount misinformation and disinformation available online. Recently there has been an explosion of interest in generative artificial intelligence (AI). Given the sensitive nature of questions regarding sexual health, many men may see generative AI as an anonymous, confidential, and accurate source of information. Specifically, OpenAI’s ChatGPT is the most used generative AI that offers an easy to navigate conversation-based resource where patients may inquire about many topics including sexual health concerns. This study aims to validate ChatGPT responses to questions about testosterone and evaluate the references used for its claims. Objective To evaluate the accuracy of ChatGPT responses to common questions about testosterone therapy and the sources substantiating claims. Methods Questions regarding testosterone function and therapy were created based on American Urological Association (AUA) Care Foundation brochures for patient education on testosterone. ChatGPT was prompted to respond to these questions during a single interactive session with the ending phrase, “include references,” to evaluate the sources of responses. Responses were then separated into individual statements made by the AI and their associated references. Each response was evaluated for accuracy by matching the AI generated statements to AUA guidelines and published AUA resources. References were evaluated on the accuracy of the authors, article, journal, and date of publication based on the PubMed index, Google Scholar, and the cited journal’s website. Results A total of 9 questions were posed to the generative AI. The interactive session with ChatGPT yielded 53 separate statements backed by 21 references. The majority of queries yielded true statements by matched criteria to AUA guidelines (50/53, 94.3%). The references provided by the generative AI were drawn primarily from peer-reviewed journals (15/21, 71.4%); hospital websites (3/21, 14.3%), unknown sources (2/21, 9.5%) and AUA guidelines (1/21, 4.8%). The mean impact factor for the cited peer reviewed journals was 27.2 (SD 60.5). A majority of references were found to be legitimate sources (13/21, 61.9%). However, 4 (19.0%) included citation errors and 4 (19.0%) references were found to be fabricated by the AI. The question yielding the most fabricated references (3/4, 75%) was “How do I boost my testosterone levels naturally?” Conclusions We found that ChatGPT offers a reliable conversation-based resource where patients may receive accurate information in response to their testosterone related questions. However, when queried about topics with fewer references, the large language model fabricated sources with real reputable authors and real journals. This is highly concerning as the general public would not be able to differentiate between real and fake information. Patients should be counseled with caution, and guardrails should be put in place for the safety of patients using large language models for medical information. Disclosure No.
Read full abstract