Abstract

This paper presents a study of using deep bidirectional long short-term memory (DBLSTM) recurrent neural network as acoustic model for DBLSTM-HMM based large vocabulary continuous speech recognition (LVCSR), where a context-sensitive-chunk (CSC) back-propagation through time (BPTT) approach is used to train DBLSTM by splitting each training sequence into chunks with appended contextual observations, and a CSC-based decoding method with possibly overlapped CSCs is used for recognition. Our approach makes mini-batch based training on GPU more efficient and reduces the latency of DBLSTM-based LVCSR from a whole utterance to a short chunk. Evaluations have been made on Switchboard-I benchmark task. In comparison with epoch-wise BPTT training, our method can achieve more than three times speedup on a single GPU card without degrading recognition accuracy. In comparison with a highly optimized DNN-HMM system trained by a frame-level cross entropy (CE) criterion, our CE-trained DBLSTM-HMM system achieves relative word error rate reductions of 9% and 5% on Eval2000 and RT03S testing sets, respectively. Furthermore, by running model averaging based parallel training of DBLSTM on a cluster of GPUs, CSC-BPTT incurs less accuracy degradation than epoch-wise BPTT while achieves a linear speedup.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.