Advanced recurrent network-based hybrid acoustic models for low resource speech recognition

Jian Kang,Wei-Qiang Zhang,Michael T Johnson,Jia Liu,Wei-Wei Liu

doi:10.1186/s13636-018-0128-6

Abstract

Recurrent neural networks (RNNs) have shown an ability to model temporal dependencies. However, the problem of exploding or vanishing gradients has limited their application. In recent years, long short-term memory RNNs (LSTM RNNs) have been proposed to solve this problem and have achieved excellent results. Bidirectional LSTM (BLSTM), which uses both preceding and following context, has shown particularly good performance. However, the computational requirements of BLSTM approaches are quite heavy, even when implemented efficiently with GPU-based high performance computers. In addition, because the output of LSTM units is bounded, there is often still a vanishing gradient issue over multiple layers. The large size of LSTM networks makes them susceptible to overfitting problems. In this work, we combine local bidirectional architecture, a new recurrent unit, gated recurrent units (GRU), and residual architectures to address the above problems. Experiments are conducted on the benchmark datasets released under the IARPA Babel Program. The proposed models achieve 3 to 10% relative improvements over their corresponding DNN or LSTM baselines across seven language collections. In addition, the new models accelerate learning speed by a factor of more than 1.6 compared to conventional BLSTM models. By using these approaches, we achieve good results in the IARPA Babel Program.

Highlights

Automatic speech recognition (ASR) has undergone rapid change in recent years
The results show that the Bidirectional long short-term memory (LSTM) (BLSTM) with a local window have nearly the same performance as BLSTM, while reducing the training time significantly, confirming the discoveries in the previous experiments
4.4 Local-window bidirectional gated recurrent unit (LW-BGRU) and LW-BrGRU In these experiments, we briefly evaluate the impact of hyperparameters in LW-BGRU and apply these two architectures to all seven languages

Summary

Introduction

Automatic speech recognition (ASR) has undergone rapid change in recent years. Deep neural networks (DNN) combined with hidden Markov models (HMM) have become the dominant approach for acoustic modeling [1, 2], replacing the traditional Gaussian mixture modelhidden Markov models (GMM-HMMs) approach. LSTM units use gates to control information flow and effectively create shortcut paths across multiple temporal steps. This gate mechanism makes LSTM architectures well suited to sequence tasks and has improved robustness [13,14,15]. RNNs are a neural network framework with selfconnections from the previous time step used as inputs This structure allows the network to capture a dynamic history of information about input feature sequences and is less affected by temporal distortion. Due to these properties, RNN have performed better than traditional DNNs in large vocabulary speech recognition tasks. We use the term LSTM to denote such a deep LSTM-projected architecture and use this approach as our baseline

Advanced recurrent architectures and training algorithms

Bidirectional LSTM

Experimental results and discussion

44.8 LW-BLSTM-MBN

Conclusions