Plagiarism in the age of massive Generative Pre-trained Transformers (GPT-3)

N Dehouche

doi:10.3354/esep00195

Abstract

As if 2020 was not a peculiar enough year, its fifth month saw the relatively quiet publication of a preprint describing the most powerful natural language processing (NLP) system to date—GPT-3 (Generative Pre-trained Transformer-3)—created by the Silicon Valley research firm OpenAI. Though the software implementation of GPT-3 is still in its initial beta release phase, and its full capabilities are still unknown as of the time of this writing, it has been shown that this artificial intelligence can comprehend prompts in natural language, on virtually any topic, and generate relevant original text content that is indistinguishable from human writing. Moreover, access to these capabilities, in a limited yet worrisome enough extent, is available to the general public. This paper presents examples of original content generated by the author using GPT-3. These examples illustrate some of the capabilities of GPT-3 in comprehending prompts in natural language and generating convincing content in response. I use these examples to raise specific fundamental questions pertaining to the intellectual property of this content and the potential use of GPT-3 to facilitate plagiarism. The goal is to instigate a sense of urgency, as well as a sense of present tardiness on the part of the academic community in addressing these questions.

Highlights

The field of natural language processing (NLP) has come a long way since Chomsky’s work on formal grammars in the late 1950s–early 1960s (Chomsky 1959, 1965) gave rise to early mathematical and computational investigations of grammars (Joshi 1991)
This paper focuses on the potential of GPT-3 to facilitate academic misconduct, defined as the ‘fabrication, falsification, or plagiarism in proposing, performing or reviewing research, or in reporting research results’ (Juyal et al 2015, p. 77) and
In the same way in which public domain texts and publicly funded research are seen as belonging to the public (Pierce & Theodossiou 2018), a case could possibly be made for the text generated by GPT-3 to be considered provided that the humanauthors of said text disclose the use of the software, along with the prompts and additional training data submitted to it

Summary

INTRODUCTION

The field of natural language processing (NLP) has come a long way since Chomsky’s work on formal grammars in the late 1950s–early 1960s (Chomsky 1959, 1965) gave rise to early mathematical and computational investigations of grammars (Joshi 1991). Though still far from possessing artificial general intelligence (AGI), GPT3 (Generative Pre-trained Transformer-3) represents an important breakthrough in this regard This NLP model was presented in a May 2020 arXiv preprint by Brown et al (2020). Ethics Sci Environ Polit 21: 17–23, 2021 model includes 175 billion parameters, one order of magnitude more than the second largest similar model to date, and its pre-training reportedly required an investment of $12 million This innovation allowed Brown et al (2020) to generate samples of news articles that were indistinguishable, to human evaluators, from articles written by humans. It was reported that the software could write original computer code, retrieve and structure data, or generate financial statements, when only prompted in natural language (Metz 2020) One of these initial users of GPT3 is AI Dungeon, a text-based gaming service that allows users to generate artificial intelligence (AI)powered virtual adventures. This paper focuses on the potential of GPT-3 to facilitate academic misconduct, defined as the ‘fabrication, falsification, or plagiarism in proposing, performing or reviewing research, or in reporting research results’ (Juyal et al 2015, p. 77) and

LITERATURE REVIEW

EXAMPLES AND DISCUSSION

CONCLUSIONS