Gated recurrent units based hybrid acoustic models for robust speech recognition

Jian Kang,Jia Liu,Wei-Qiang Zhang

doi:10.1109/iscslp.2016.7918456

Abstract

Recurrent neural networks (RNNs) have shown an ability to model temporal dependencies. However the problem of exploding or vanishing gradients has limited their application. In recent years, long short-term memory RNNs (LSTM RNNs) have been proposed to solve this problem, and have achieved excellent results. However, because of the large size of LSTM RNNs, they more easily suffer from overfitting, especially for low resource tasks. In addition, because the output of LSTM units is bounded, there is often still a vanishing gradient issue over multiple layers. In this work, we evaluate an architecture called gated recurrent units (GRU) to solve these two problems. In comparison with LSTM RNN, the size of the GRU network is smaller, so this model can more easily avoid overfitting. Furthermore, the output of the GRU is not constrained to be bounded, which helps alleviate the negative impact of vanishing gradients across multiple layers. We propose using deep bidirectional GRUs as hybrid acoustic models and compare the similarities and differences between LSTM and GRU. We evaluate this architecture on the CHIME2 dataset, a robust low resource speech recognition task. Results demonstrate that our architecture outperforms LSTM, relatively decreasing the WER about 6% in the bidirectional case and, when combined with a baseline GMM system, achieves 1.1% absolute WER reduction comparing to a strong baseline mixing system.

Full Text