Can Machines Tell Stories? A Comparative Study of Deep Neural Language Models and Metrics

Avisha Das,Rakesh M Verma

doi:10.1109/access.2020.3023421

Abstract

Massive textual content has enabled rapid advances in natural language modeling. The use of pre-trained deep neural language models has significantly improved natural language understanding tasks. However, the extent to which these systems can be applied to content generation is unclear. While a few informal studies have claimed that these models can generate ‘high quality’ readable content, there is no prior study on analyzing the generated content from these models based on sampling and fine-tuning hyperparameters. We conduct an in-depth comparison of several language models for open-ended story generation from given prompts. Using a diverse set of automated metrics, we compare the performance of transformer-based generative models – OpenAI’s GPT2 (pre-trained and fine-tuned) and Google’s pre-trained Transformer-XL and XLNet to human-written textual references. Studying inter-metric correlation along with metric ranking reveals interesting insights – the high correlation between the readability scores and word usage in the text. A study of the statistical significance and empirical evaluations between the scores (human and machine-generated) at higher sampling hyperparameter combinations ( $t=\{0.75, 1.0\}$ , $k=\{100, 150, 250\}$ ) reveal that the top pre-trained and fine-tuned models generated samples condition well on the prompt with an increased occurrence of unique and difficult words. The GPT2-medium model fine-tuned on the 1024 Byte-pair Encoding (BPE) tokenized version of the dataset along with pre-trained Transformer-XL models generated samples close to human written content on three metrics: prompt-based overlap, coherence, and variation in sentence length. A study of overall model stability and performance shows that fine-tuned GPT2 language models have the least deviation in metric scores from human performance.

Highlights

Natural language generation has gained popularity with new language resources and language models, which can be used to emulate the stylistic aspects of the training dataset
The experiments and statistical significance results support that samples generated by the pre-trained models openai-gpt, transfo-xl and xlnet-large are closest to the human samples in FRE scores
Contrary to the previous automated metric study [27], FRE does not capture the linguistic quality of the generated content well

Summary

Introduction

Natural language generation has gained popularity with new language resources and language models, which can be used to emulate the stylistic aspects of the training dataset. Researchers have proposed several novel architectures capable of modeling robust representations of natural language – recurrent neural networks (RNNs) [47], sequential encoders-decoders (sequence-to-sequence learning) [8], generative adversarial networks (GANs) [9], and transformers with attention-modeling [51]. Use of large-scale neural language models trained on massive volumes of textual content has emerged as a solution to many natural language based tasks. Available pre-trained models, such as OpenAI’s GPT [35], [36], AllenNLP’s ELMo [31], Google’s BERT [7], and Google/CMU’s XLNet [55], have improved performance on natural language understanding tasks considerably.. Available pre-trained models, such as OpenAI’s GPT [35], [36], AllenNLP’s ELMo [31], Google’s BERT [7], and Google/CMU’s XLNet [55], have improved performance on natural language understanding tasks considerably. These

Methods

Results

Conclusion

Full Text

Paper version not known

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: IEEE Access	Publication Date: Jan 1, 2020
Citations: 41	License type: CC BY 4.0

R Discovery Prime

R Discovery Prime

Can Machines Tell Stories? A Comparative Study of Deep Neural Language Models and Metrics

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: IEEE Access

Lead the way for us

Similar Papers

Sense representations for Portuguese: experiments with sense embeddings and deep neural language models
Jéssica Rodrigues Da Silva ... Helena De M Caseli
Language Resources and Evaluation | VOL. 55
Jéssica Rodrigues Da Silva, et. al.Jéssica Rodrigues Da Silva ... Helena De M Caseli
28 Feb 2021
Language Resources and Evaluation | VOL. 55

Low Anisotropy Sense Retrofitting (LASeR) : Towards Isotropic and Sense Enriched Representations
Geetanjali Bihani ... Julia Rayz
-
Geetanjali Bihani, et. al.Geetanjali Bihani ... Julia Rayz
01 Jan 2020
01 Jan 2020

Portuguese Language Models and Word Embeddings: Evaluating on Semantic Similarity Tasks
Ruan Chaves Rodrigues ... Jéssica Rodrigues
-
Ruan Chaves Rodrigues, et. al.Ruan Chaves Rodrigues ... Jéssica Rodrigues
01 Jan 2020
01 Jan 2020

Dimensionality and Ramping: Signatures of Sentence Integration in the Dynamics of Brains and Deep Language Models.
Théo Desbordes ... Christian-G Bénar
The Journal of Neuroscience | VOL. 43
Théo Desbordes, et. al.Théo Desbordes ... Christian-G Bénar
22 May 2023
The Journal of Neuroscience | VOL. 43

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Can Machines Tell Stories? A Comparative Study of Deep Neural Language Models and Metrics

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: IEEE Access