Abstract

Neural network language models (NNLM) have been proved to be quite powerful for sequence modeling, including feed-forward NNLM (FNNLM), recurrent NNLM (RNNLM), etc. One main issue concerned for NNLM is the heavy computational burden of the output layer, where the output needs to be probabilistically normalized and the normalizing factors require lots of computation. How to fast rescore the N-best list or lattice with NNLM attracts much attention for large-scale applications. In this paper, the statistic characteristics of normalizing factors are investigated on the N-best list. Based on the statistic observations, we propose to approximate the normalizing factors for each hypothesis as a constant proportional to the number of words in the hypothesis. Then, the unnormalized NNLM is investigated and combined with back-off N-gram for fast rescoring, which can be computed very fast without the normalization in the output layer, with the complexity reduced significantly. We apply our proposed method to a well-tuned context-dependent deep neural network hidden Markov model (CD-DNN-HMM) speech recognition system on the English-Switchboard phone-call speech-to-text task, where both FNNLM and RNNLM are trained to demonstrate our method. Experimental results show that unnormalized probability of NNLM is quite complementary to that of back-off N-gram, and combining the unnormalized NNLM and back-off N-gram can further reduce the word error rate with little computational consideration.

Highlights

  • The output of the speech-to-text (STT) system is usually a multi-candidate form encoded as lattice or N-best list

  • It is worthy to notice that the fast-UP-forward Neural network language models (NNLM) (FNNLM) is more than 25 times faster than FNNLM + class layer and more than 1,100 times faster than FNNLM

  • The language scores of back-off 3-gram based on KN smoothing (KN3) is usually available in the lattice or N-best list, so that the UP-recurrent NNLM (RNNLM) combined with the KN3 reduces word error rate (WER) by 0.8% and 1.2% absolute on Hub5’00SWB and RT03S-FSH sets, respectively

Read more

Summary

Introduction

The output of the speech-to-text (STT) system is usually a multi-candidate form encoded as lattice or N-best list. We apply our proposed method to a well-tuned context-dependent deep neural network hidden Markov model (CD-DNN-HMM) speech recognition system on the English-Switchboard speech-to-text task. Both feedforward NNLM and recurrent NNLM are well-trained to verify the effectiveness of our method. As our method is theoretically founded on the statistic observations, we first introduce the experimental setup, including the speech recognizer, N-best hypotheses, NNLM structure, and NNLM training, in Section 2 for convenience. 2.1 Speech recognizer and N-best hypotheses The effectiveness of our proposed method is evaluated on the STT task with the 309-hour Switchboard-I training set [15]. Top 100-best hypotheses are rescored and reranked by other language models, such as back-off 5-gram, FNNLM, and RNNLM, to improve the performance

Structure and training of NNLM
Statistics of normalizing factors on N-best hypotheses
Normalizing factor for one word
Normalizing factor for one hypothesis
Combining unnormalized NNLM and back-off N-gram
Findings
Conclusions
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call