Language Model Using Neural Turing Machine Based on Localized Content-Based Addressing

Donghyun Lee,Ji-Hwan Kim,Jeong-Sik Park,Myoung-Wan Koo

doi:10.3390/app10207181

Donghyun Lee, Ji-Hwan Kim + Show 2 more

Open Access

https://doi.org/10.3390/app10207181

Copy DOI

Abstract

The performance of a long short-term memory (LSTM) recurrent neural network (RNN)-based language model has been improved on language model benchmarks. Although a recurrent layer has been widely used, previous studies showed that an LSTM RNN-based language model (LM) cannot overcome the limitation of the context length. To train LMs on longer sequences, attention mechanism-based models have recently been used. In this paper, we propose a LM using a neural Turing machine (NTM) architecture based on localized content-based addressing (LCA). The NTM architecture is one of the attention-based model. However, the NTM encounters a problem with content-based addressing because all memory addresses need to be accessed for calculating cosine similarities. To address this problem, we propose an LCA method. The LCA method searches for the maximum of all cosine similarities generated from all memory addresses. Next, a specific memory area including the selected memory address is normalized with the softmax function. The LCA method is applied to pre-trained NTM-based LM during the test stage. The proposed architecture is evaluated on Penn Treebank and enwik8 LM tasks. The experimental results indicate that the proposed approach outperforms the previous NTM architecture.

Highlights

A language model (LM) estimates the probability of the current word based on the previous word sequence
The hyper-parameters of the trellis and AWD-long short-term memory (LSTM) networks were the same as those used in previous studies
We evaluated the performance of the MDM-neural Turing machine (NTM) architecture according to the weight decay

Summary

Introduction

A language model (LM) estimates the probability of the current word based on the previous word sequence. For the word sequence W = (w1 , w2 , ..., w N ), the probability of the LM is denoted as P(W ). When the number of words in the word history (w1 , w2 , ..., wi−1 ) increases, it becomes increasingly difficult to calculate the probability for the current word wi , because the word history will not appear in the text corpus. For this reason, the Markov assumption is applied to the LM to compute P(wi |w1 , ..., wi−1 ). The length of the word sequence which affects wi is (n − 1)

Results

Discussion

Conclusion