Abstract

Massive textual content has enabled rapid advances in natural language modeling. The use of pre-trained deep neural language models has significantly improved natural language understanding tasks. However, the extent to which these systems can be applied to content generation is unclear. While a few informal studies have claimed that these models can generate ‘high quality’ readable content, there is no prior study on analyzing the generated content from these models based on sampling and fine-tuning hyperparameters. We conduct an in-depth comparison of several language models for open-ended story generation from given prompts. Using a diverse set of automated metrics, we compare the performance of transformer-based generative models – OpenAI’s GPT2 (pre-trained and fine-tuned) and Google’s pre-trained Transformer-XL and XLNet to human-written textual references. Studying inter-metric correlation along with metric ranking reveals interesting insights – the high correlation between the readability scores and word usage in the text. A study of the statistical significance and empirical evaluations between the scores (human and machine-generated) at higher sampling hyperparameter combinations ( $t=\{0.75, 1.0\}$ , $k=\{100, 150, 250\}$ ) reveal that the top pre-trained and fine-tuned models generated samples condition well on the prompt with an increased occurrence of unique and difficult words. The GPT2-medium model fine-tuned on the 1024 Byte-pair Encoding (BPE) tokenized version of the dataset along with pre-trained Transformer-XL models generated samples close to human written content on three metrics: prompt-based overlap, coherence, and variation in sentence length. A study of overall model stability and performance shows that fine-tuned GPT2 language models have the least deviation in metric scores from human performance.

Highlights

  • Natural language generation has gained popularity with new language resources and language models, which can be used to emulate the stylistic aspects of the training dataset

  • The experiments and statistical significance results support that samples generated by the pre-trained models openai-gpt, transfo-xl and xlnet-large are closest to the human samples in FRE scores

  • Contrary to the previous automated metric study [27], FRE does not capture the linguistic quality of the generated content well

Read more

Summary

Introduction

Natural language generation has gained popularity with new language resources and language models, which can be used to emulate the stylistic aspects of the training dataset. Researchers have proposed several novel architectures capable of modeling robust representations of natural language – recurrent neural networks (RNNs) [47], sequential encoders-decoders (sequence-to-sequence learning) [8], generative adversarial networks (GANs) [9], and transformers with attention-modeling [51]. Use of large-scale neural language models trained on massive volumes of textual content has emerged as a solution to many natural language based tasks. Available pre-trained models, such as OpenAI’s GPT [35], [36], AllenNLP’s ELMo [31], Google’s BERT [7], and Google/CMU’s XLNet [55], have improved performance on natural language understanding tasks considerably.. Available pre-trained models, such as OpenAI’s GPT [35], [36], AllenNLP’s ELMo [31], Google’s BERT [7], and Google/CMU’s XLNet [55], have improved performance on natural language understanding tasks considerably. These

Methods
Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call