Abstract

Long short-term memory (LSTM) unit has been widely used in speech recognition tasks, both for acoustic model and language model. For offline speech recognition task, bidirectional LSTM (BLSTM) is the state-of-the-art acoustic model. In this paper, we propose the BLSTM with extended input context (BLSTM-E), which achieves higher speech recognition accuracy than the standard BLSTM. Time delay neural network (TDNN) or element-wise scale block-sum network (ESBN) is used to extend the input context of forward and backward LSTM. Our experiments show that the proposed ESBN-BLSTM-E can achieve 0.9% absolute reduction in word error rate (WER) trained on one 1000 hours Chinese conversational telephone speech (CTS) compared with the standard BLSTM. Meanwhile, compared with the standard BLSTM, ESBN-BLSTM-E reduces relative 22.1% model parameter size.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call