Abstract

Prior work has explored directly regularizing the output distributions of probabilistic models to alleviate peaky (i.e. over-confident) predictions, a common sign of overfitting. This class of techniques, of which label smoothing is one, has a connection to entropy regularization. Despite the consistent success of label smoothing across architectures and data sets in language generation tasks, two problems remain open: (1) there is little understanding of the underlying effects entropy regularizers have on models, and (2) the full space of entropy regularization techniques is largely unexplored. We introduce a parametric family of entropy regularizers, which includes label smoothing as a special case, and use it to gain a better understanding of the relationship between the entropy of a model and its performance on language generation tasks. We also find that variance in model performance can be explained largely by the resulting entropy of the model. Lastly, we find that label smoothing provably does not allow for sparsity in an output distribution, an undesirable property for language generation models, and therefore advise the use of other entropy regularization methods in its place.

Highlights

  • When training large neural networks with millions of parameters, regularization of some form is needed to prevent overfitting, even when large amounts of data are used; models for language generation are no exception

  • We use generalized entropy regularization (GER) to examine the relationship between entropy and the evaluation metrics in two language generation tasks: neural machine translation (NMT) and abstractive summarization

  • We evaluate our family of entropy regularizers on two language generation tasks: machine translation and abstractive summarization

Read more

Summary

Introduction

When training large neural networks with millions of parameters, regularization of some form is needed to prevent overfitting, even when large amounts of data are used; models for language generation are no exception. E.g. when the final layer of the neural network is a softmax, overfitting often manifests itself in overconfident placement of most of the probability mass on a few candidates, resulting in peaky (low-entropy) probability distributions over the vocabulary. Despite the clear relationship between low entropy and overfitting, only a handful of distinct entropy regularizers have been explored. To fill this gap, we introduce generalized entropy regularization (GER), a unified framework for understanding and exploring a broad range of entropyinducing regularizers. We use GER to examine the relationship between entropy and the evaluation metrics in two language generation tasks: neural machine translation (NMT) and abstractive summarization

Methods
Discussion
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call