Error detection and accuracy estimation in automatic speech recognition using deep bidirectional recurrent neural networks

Atsunori Ogawa,Takaaki Hori

doi:10.1016/j.specom.2017.02.009

Abstract

Recurrent neural networks (RNNs) have recently been applied as the classifiers for sequential labeling problems. In this paper, deep bidirectional RNNs (DBRNNs) are applied to error detection in automatic speech recognition (ASR), which is a sequential labeling problem. We investigate three types of ASR error detection tasks, i.e. confidence estimation, out-of-vocabulary word detection and error type classification. We also estimate ASR accuracy, i.e. percent correct and word accuracy, from the error type classification results. Experimental results using English and Japanese lecture speech corpora show that the DBRNNs greatly outperform conditional random fields (CRFs) and the other NN structures, i.e. deep feedforward NNs (DNNs) and deep unidirectional RNNs (DURNNs). These performance improvements are because the DBRNNs can take the longer bidirectional context of input feature vectors into account and can model highly nonlinear relationships between the input feature vectors and output labels. In detailed analyses, the DBRNNs show a better generalization ability than the CRFs. These results are thanks to the ability of the DBRNNs to represent (embed) the words in a low-dimensional continuous value vector space. In addition, the superiority of the DBRNNs to the DNNs and DURNNs indicates that the average length of the context of the input feature vectors required for ASR error detection is only a few time steps, however, it will change (lengthen) depending on the situation.

Full Text