Abstract
Recurrent neural networks are efficient ways of training language models, and various RNN networks have been proposed to improve performance. However, with the increase of network scales, the overfitting problem becomes more urgent. In this paper, we propose a framework—G2Basy—to speed up the training process and ease the overfitting problem. Instead of using predefined hyperparameters, we devise a gradient increasing and decreasing technique that changes the parameters training batch size and input dropout simultaneously by a user-defined step size. Together with a pretrained word embedding initialization procedure and the introduction of different optimizers at different learning rates, our framework speeds up the training process dramatically and improves performance compared with a benchmark model of the same scale. For the word embedding initialization, we propose the concept of “artificial features” to describe the characteristics of the obtained word embeddings. We experiment on two of the most often used corpora—the Penn Treebank and WikiText-2 datasets—and both outperform the benchmark results and show potential towards further improvement. Furthermore, our framework shows better results with the larger and more complicated WikiText-2 corpus than with the Penn Treebank. Compared with other state-of-the-art results, we achieve comparable results with network scales hundreds of times smaller and within fewer training epochs.
Highlights
Natural language processing (NLP) is the area of artificial intelligence that concerns the automatic generation and understanding of human languages [1]
Language models are an essential part of NLP that can predict upcoming words based on a given context [2]
ASGD still works if we introduce it at a learning rate of 0.3125, but the training soon begins to overfit after a few epochs
Summary
Natural language processing (NLP) is the area of artificial intelligence that concerns the automatic generation and understanding of human languages [1]. To alleviate the overfitting problem and enhance the generalization ability of language models, mechanisms like tied weights [12], dropout [13], and a vast variety of optimization algorithms, such as Momentum [14], Adadelta [15], and Adam [16], have been proposed These techniques do not work well on RNNs, especially on LSTM networks [17], which are designed to solve long time lag tasks. It uses the pretrained GloVe word embeddings to initialize its input vectors and changes optimization algorithms during training. Compared with other state-of-the-art regularized multilayer RNN models with much larger scales, our framework still achieves close results
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.