Multiple time-span feature fusion for deep neural network modeling

Chongjia Ni,Nancy F Chen,Bin Ma

doi:10.1109/iscslp.2014.6936707

Abstract

In this paper, we exploit long term information from multiple time-spans for automatic speech recognition. The multiple time-span information is encoded into three different feature streams: speaker-adaptation-transformed features, deep bottleneck features and deep hierarchical bottleneck features. By combining three different time-spans in discriminative acoustic modeling, the character/syllable error rate improves for Mandarin and Vietnamese conversational telephone speech recognition. We obtain 0.8% and 1.9% absolute over DNN-HMM baselines in character error rate and syllable error rate for Mandarin and Vietnamese, respectively. Further analysis also suggests that our proposed feature fusion approach is able to encode finer-grain temporal information than directly using input features of long time-spans in DNN-HMM baselines.

Full Text