Learning to Fake It: Limited Responses and Fabricated References Provided by ChatGPT for Medical Questions

Jocelyn Gravel,Madeleine D’Amours-Gravel,Esli Osmanlliu

doi:10.1016/j.mcpdig.2023.05.004

Abstract

ObjectiveTo evaluate the quality of the answers and the references provided by ChatGPT for medical questions. Patients and MethodsThree researchers asked ChatGPT 20 medical questions and prompted it to provide the corresponding references. The responses were evaluated for the quality of content by medical experts using a verbal numeric scale going from 0% to 100%. These experts were the corresponding authors of the 20 articles from where the medical questions were derived. We planned to evaluate 3 references per response for their pertinence, but this was amended on the basis of preliminary results showing that most references provided by ChatGPT were fabricated. This experimental observational study was conducted in February 2023. ResultsChatGPT provided responses varying between 53 and 244 words long and reported 2 to 7 references per answer. Seventeen of the 20 invited raters provided feedback. The raters reported limited quality of the responses, with a median score of 60% (first and third quartiles: 50% and 85%, respectively). In addition, they identified major (n=5) and minor (n=7) factual errors among the 17 evaluated responses. Of the 59 references evaluated, 41 (69%) were fabricated, although they appeared real. Most fabricated citations used names of authors with previous relevant publications, a title that seemed pertinent and a credible journal format. ConclusionWhen asked multiple medical questions, ChatGPT provided answers of limited quality for scientific publication. More importantly, ChatGPT provided deceptively real references. Users of ChatGPT should pay particular attention to the references provided before integration into medical manuscripts.

Full Text