No consensus exists on performance standards for evaluation of generative artificial intelligence (AI) to generate medical responses. The purpose of this study was the assessment of Chat Generative Pre-training Transformer (ChatGPT) to address medical questions in prostate cancer. A global online survey was conducted April-June 2023 among >700 medical oncologists or urologists who treat patients with prostate cancer. Participants were unaware this was a survey evaluating AI. In component 1, responses to nine questions were written independently by medical writers (MW; from medical websites) and ChatGPT-4.0 (AI-generated from publicly available information). Respondents were randomly exposed and blinded to both AI-generated and MW-curated responses; evaluation criteria and overall preference were recorded. Exploratory component 2 evaluated AI-generated responses to five complex questions with nuanced answers in the medical literature. Responses were evaluated on a 5-point Likert scale. Statistical significance was denoted by P < .05. In component 1, respondents (N = 602) consistently preferred the clarity of AI-generated responses over MW-curated responses in 7/9 questions (P < .05). Despite favoring AI-generated responses when blinded to questions/answers, respondents considered medical websites a more credible source (52%-67%) than ChatGPT (14%). Respondents in component 2 (N = 98) also considered medical websites more credible than ChatGPT, but rated AI-generated responses highly for all evaluation criteria, despite nuanced answers in the medical literature. These findings provide insight into how clinicians rate AI-generated and MW-curated responses with evaluation criteria that can be used in future AI validation studies.
Read full abstract