Abstract

In this paper, we exploit long term information from multiple time-spans for automatic speech recognition. The multiple time-span information is encoded into three different feature streams: speaker-adaptation-transformed features, deep bottleneck features and deep hierarchical bottleneck features. By combining three different time-spans in discriminative acoustic modeling, the character/syllable error rate improves for Mandarin and Vietnamese conversational telephone speech recognition. We obtain 0.8% and 1.9% absolute over DNN-HMM baselines in character error rate and syllable error rate for Mandarin and Vietnamese, respectively. Further analysis also suggests that our proposed feature fusion approach is able to encode finer-grain temporal information than directly using input features of long time-spans in DNN-HMM baselines.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call