The great detectives: humans versus AI detectors in catching large language model-generated medical writing

Jae Q J Liu,Zing Z X Zhou,Dino Samartzis,Fadi Al Zoubi,Curtis C H Yu,Kelvin T K Hui,Jeremy R Chang,Arnold Y L Wong

doi:10.1007/s40979-024-00155-6

Jae Q J Liu, Zing Z X Zhou + Show 6 more

Open Access

https://doi.org/10.1007/s40979-024-00155-6

Copy DOI

Journal: International Journal for Educational Integrity	Publication Date: May 20, 2024
Citations: 4	License type: CC BY 4.0

Abstract

BackgroundThe application of artificial intelligence (AI) in academic writing has raised concerns regarding accuracy, ethics, and scientific rigour. Some AI content detectors may not accurately identify AI-generated texts, especially those that have undergone paraphrasing. Therefore, there is a pressing need for efficacious approaches or guidelines to govern AI usage in specific disciplines.ObjectiveOur study aims to compare the accuracy of mainstream AI content detectors and human reviewers in detecting AI-generated rehabilitation-related articles with or without paraphrasing.Study designThis cross-sectional study purposively chose 50 rehabilitation-related articles from four peer-reviewed journals, and then fabricated another 50 articles using ChatGPT. Specifically, ChatGPT was used to generate the introduction, discussion, and conclusion sections based on the original titles, methods, and results. Wordtune was then used to rephrase the ChatGPT-generated articles. Six common AI content detectors (Originality.ai, Turnitin, ZeroGPT, GPTZero, Content at Scale, and GPT-2 Output Detector) were employed to identify AI content for the original, ChatGPT-generated and AI-rephrased articles. Four human reviewers (two student reviewers and two professorial reviewers) were recruited to differentiate between the original articles and AI-rephrased articles, which were expected to be more difficult to detect. They were instructed to give reasons for their judgements.ResultsOriginality.ai correctly detected 100% of ChatGPT-generated and AI-rephrased texts. ZeroGPT accurately detected 96% of ChatGPT-generated and 88% of AI-rephrased articles. The areas under the receiver operating characteristic curve (AUROC) of ZeroGPT were 0.98 for identifying human-written and AI articles. Turnitin showed a 0% misclassification rate for human-written articles, although it only identified 30% of AI-rephrased articles. Professorial reviewers accurately discriminated at least 96% of AI-rephrased articles, but they misclassified 12% of human-written articles as AI-generated. On average, students only identified 76% of AI-rephrased articles. Reviewers identified AI-rephrased articles based on ‘incoherent content’ (34.36%), followed by ‘grammatical errors’ (20.26%), and ‘insufficient evidence’ (16.15%).Conclusions and relevanceThis study directly compared the accuracy of advanced AI detectors and human reviewers in detecting AI-generated medical writing after paraphrasing. Our findings demonstrate that specific detectors and experienced reviewers can accurately identify articles generated by Large Language Models, even after paraphrasing. The rationale employed by our reviewers in their assessments can inform future evaluation strategies for monitoring AI usage in medical education or publications. AI content detectors may be incorporated as an additional screening tool in the peer-review process of academic journals.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

The great detectives: humans versus AI detectors in catching large language model-generated medical writing

Abstract

Talk to us

Similar Papers

More From: International Journal for Educational Integrity

Lead the way for us

Similar Papers

How Can IJDS Authors, Reviewers, and Editors Use (and Misuse) Generative AI?
Galit Shmueli ... W Nick Street
INFORMS Journal on Data Science | VOL. 2
Galit Shmueli, et. al.Galit Shmueli ... W Nick Street
01 Apr 2023
INFORMS Journal on Data Science | VOL. 2

The rise of artificial intelligence: addressing the impact of large language models such as ChatGPT on scientific publications.
Tiing Leong Ang ... Mahesh Choolani
Singapore Medical Journal | VOL. 64
Tiing Leong Ang, et. al.Tiing Leong Ang ... Mahesh Choolani
30 Mar 2023
Singapore Medical Journal | VOL. 64

Getting AI Right: Introductory Notes on AI & Society
James Manyika
Daedalus | VOL. 151
James ManyikaJames Manyika
01 May 2022
Daedalus | VOL. 151

Research Output on the Usage of Artificial Intelligence in Indian Higher Education - A Scientometric Study
Kalyan Kumar Bhattacharjee
-
Kalyan Kumar BhattacharjeeKalyan Kumar Bhattacharjee
01 Dec 2019
01 Dec 2019

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

The great detectives: humans versus AI detectors in catching large language model-generated medical writing

Abstract

Talk to us

Similar Papers

More From: International Journal for Educational Integrity