Pyramidal Recurrent Unit for Language Modeling

Sachin Mehta,Hannaneh Hajishirzi,Rik Koncel-Kedziorski,Mohammad Rastegari

doi:10.18653/v1/d18-1491

Abstract

LSTMs are powerful tools for modeling contextual information, as evidenced by their success at the task of language modeling. However, modeling contexts in very high dimensional space can lead to poor generalizability. We introduce the Pyramidal Recurrent Unit (PRU), which enables learning representations in high dimensional space with more generalization power and fewer parameters. PRUs replace the linear transformation in LSTMs with more sophisticated interactions such as pyramidal or grouped linear transformations. This architecture gives strong results on word-level language modeling while reducing parameters significantly. In particular, PRU improves the perplexity of a recent state-of-the-art language model by up to 1.3 points while learning 15-20% fewer parameters. For similar number of model parameters, PRU outperforms all previous RNN models that exploit different gating mechanisms and transformations. We provide a detailed examination of the PRU and its behavior on the language modeling tasks. Our code is open-source and available at https://sacmehta.github.io/PRU/.

Highlights

Long short term memory (LSTM) units (Hochreiter and Schmidhuber, 1997) are popular for many sequence modeling tasks and are used extensively in language modeling
We can see that the Pyramidal Recurrent Unit (PRU) achieves the best performance with fewer parameters
When we evaluate PRU-based language models with dynamic evaluation on the Penn Treebank (PTB) test set, the perplexity of PRU (g = 4, k = 2, M = 1400) improves from 62.42 to 55.23 while the perplexity of an LSTM (M = 1000) with similar settings improves from 66.29 to 58.79; suggesting that modern inference techniques are applicable to PRU-based language models

Summary

Introduction

Long short term memory (LSTM) units (Hochreiter and Schmidhuber, 1997) are popular for many sequence modeling tasks and are used extensively in language modeling. Despite the sophistication of the gating mechanisms employed in LSTMs and similar recurrent units, the input and context vectors are treated with simple linear transformations prior to gating. Non-linear transformations such as convolutions (Kim et al, 2016) have been used, but these have not achieved the performance of well regularized LSTMs for language modeling (Melis et al, 2018). A natural way to improve the expressiveness of linear transformations is to increase the number of dimensions of the input and context vectors, but this comes with a significant increase in the number of parameters which may limit generalizability. An example is shown, where LSTMs performance decreases with the increase in dimensions of the input and context vectors. The semantics of the input and context vectors are different, suggesting that each may benefit from specialized treatment

Methods

Results

Conclusion