Artificial intelligence versus Maya Angelou: Experimental evidence that people cannot differentiate AI-generated from human-written poetry

Nils Köbis,Luca D Mossink

doi:10.1016/j.chb.2020.106553

Abstract

The release of openly available, robust natural language generation algorithms (NLG) has spurred much public attention and debate. One reason lies in the algorithms' purported ability to generate humanlike text across various domains. Empirical evidence using incentivized tasks to assess whether people (a) can distinguish and (b) prefer algorithm-generated versus human-written text is lacking. We conducted two experiments assessing behavioral reactions to the state-of-the-art Natural Language Generation algorithm GPT-2 (Ntotal = 830). Using the identical starting lines of human poems, GPT-2 produced samples of poems. From these samples, either a random poem was chosen (Human-out-of-theloop) or the best one was selected (Human-in-the-loop) and in turn matched with a human-written poem. In a new incentivized version of the Turing Test, participants failed to reliably detect the algorithmicallygenerated poems in the Human-in-the-loop treatment, yet succeeded in the Human-out-of-the-loop treatment. Further, people reveal a slight aversion to algorithm-generated poetry, independent on whether participants were informed about the algorithmic origin of the poem (Transparency) or not (Opacity). We discuss what these results convey about the performance of NLG algorithms to produce human-like text and propose methodologies to study such learning algorithms in human-agent experimental settings.

Highlights

The release of openly available, robust natural language generation algorithms (NLG) has spurred much public attention and debate
Do people prefer creative text written by fellow humans to that generated by algorithms? Does the information about the respective origin – being a human or an algorithm – sway this preference for the creative-text output? Does human involvement in the selection of the text output generated by the algorithm make a difference? To address these ques tions, we use incentivized paradigms to extend previous research into
In Study 2, we introduced a between-subjects manipulation of selection proced ure, namely, whether the poems entering the competition were again selected by the authors, that is, HITL, versus randomly sampled from the outputs that Generative Pre-Training 2 (GPT-2) produced, that is, HOTL

Summary

Introduction

The release of openly available, robust natural language generation algorithms (NLG) has spurred much public attention and debate. Empirical evidence using incentivized tasks to assess whether people (a) can distinguish and (b) prefer algorithm-generated versus human-written text is lacking. People reveal a slight aversion to algorithm-generated poetry, in dependent on whether participants were informed about the algorithmic origin of the poem (Transparency) or not (Opacity). We discuss what these results convey about the performance of NLG algorithms to produce human-like text and propose methodologies to study such learning algorithms in human-agent experimental settings. The text outputs adhere to grammatical and semantical rules, and allegedly reach human levels Due to such claims about the unprecedented abilities and the potential ethical challenges it raises, for example, as a tool for disinformation (Kreps & McCain, 2019), much controversy accompa nied the algorithm’s release (The Guardian, 2019). Much research has focused on the machinery – how to design algorithms to write creative text (Boden, 2009; Gonçalo; Oliveira, 2009; Gonçalo Oliveira, 2018) – research on humans’ behavioral re actions to such algorithms is much less developed

Methods

Results

Discussion

Conclusion