Abstract
Common recurrent neural architectures scale poorly due to the intrinsic difficulty in parallelizing their state computations. In this work, we propose the Simple Recurrent Unit (SRU), a light recurrent unit that balances model capacity and scalability. SRU is designed to provide expressive recurrence, enable highly parallelized implementation, and comes with careful initialization to facilitate training of deep models. We demonstrate the effectiveness of SRU on multiple NLP tasks. SRU achieves 5—9x speed-up over cuDNN-optimized LSTM on classification and question answering datasets, and delivers stronger results than LSTM and convolutional models. We also obtain an average of 0.7 BLEU improvement over the Transformer model (Vaswani et al., 2017) on translation by incorporating SRU into the architecture.
Highlights
On the movie review (MR) dataset for instance, Simple Recurrent Unit (SRU) completes 100 training epochs within 40 seconds, while Long Short-term Memory (LSTM) takes over 320 seconds
SRU exhibits over 5x speed-up over LSTM and 53–63% reduction in total training time
Our 5layer model obtains an average improvement of 0.7 test BLEU score and an improvement of 0.5 BLEU score by comparing the best results of each model achieved across three runs
Summary
Recurrent neural networks (RNN) are at the core of state-of-the-art approaches for a large number of natural language tasks, including machine translation (Cho et al, 2014; Bahdanau et al, 2015; Jean et al, 2015; Luong et al, 2015), language modeling (Zaremba et al, 2014; Gal and Ghahramani, 2016; Zoph and Le, 2016), opinion mining (Irsoy and Cardie, 2014), and situated language understanding (Mei et al, 2016; Misra et al, 2017; Suhr et al, 2018; Suhr and Artzi, 2018). The difficulty of scaling recurrent networks arises from the time dependence of state computation In common architectures, such as Long Short-term Memory (LSTM; Hochreiter and Schmidhuber, 1997) and Gated Recurrent Units (GRU; Cho et al, 2014), the computation of each step is suspended until the complete execution of the previous step. This sequential dependency makes recurrent networks significantly slower than other operations, and limits their applicability. SRU replaces the use of convolutions (i.e., ngram filters), as in QRNN and KNN, with more recurrent connections This retains modeling capacity, while using less computation (and hyper-parameters). We obtain an average improvement of 0.7 BLEU score on the English to German translation task by incorporating SRU into Transformer (Vaswani et al, 2017)
Published Version (
Free)
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have