Long Short-Term Memory Recurrent Neural Network for Automatic Speech Recognition

Jane Oruh,Serestina Viriri,Adekanmi Adegun

doi:10.1109/access.2022.3159339

Jane Oruh, Serestina Viriri + Show 1 more

Open Access

PDF Available

https://doi.org/10.1109/access.2022.3159339

Copy DOI

Export

Save

Cite

Journal: IEEE Access	Publication Date: Jan 1, 2022
Citations: 69	License type: CC BY 4.0

Affiliation: University of KwaZulu-Natal

Abstract
Full-Text PDF
Similar Papers

Abstract

Listen

Automatic speech recognition (ASR) is one of the most demanding tasks in natural language processing owing to its complexity. Recently, deep learning approaches have been deployed for this task and have been proven to outperform traditional machine learning approaches such as Artificial Neural Network (ANN). In particular, deep-learning methods such as long short-term memory (LSTM) have achieved improved ASR performance. However, this method is limited to processing continuous input streams. Traditional LSTM requires four (4) linear layers (multilayer perceptron (MLP) layer) per cell with a large memory bandwidth for each sequence time step. LSTM cannot accommodate the many computational units required for processing continuous input streams because the system does not have sufficient memory bandwidth to feed the computational units. In this study, an enhanced deep learning LSTM recurrent neural network (RNN) model was proposed to resolve this shortcoming. In the proposed model, the RNN is incorporated as a “forget gate” to the memory block to allow the resetting of cell states at the beginning of the sub-sequences. This enables the system to process continuous input streams efficiently without necessarily increasing the required bandwidths. In the proposed model, the standard architecture of the LSTM network is modified to effectively use the model parameters. Some CNN-based and sequential models were used on the same dataset, and the models were compared with the proposed model. LSTM-RNN outperformed the other deep learning models with an accuracy of 99.36% on the well-established public benchmark spoken English digit dataset.

Full Text