Abstract

This paper presents a novel deep neural network (DNN) architecture with highway blocks (HWs) using a complex discrete Fourier transform (DFT) feature for keyword spotting. In our previous work, we showed that the feed-forward DNN with a time-delayed bottleneck layer (TDB-DNN) directly trained from the audio input outperformed the model with the log-mel filter bank energy feature (LFBE), given a large amount of training data [1]. However, the deeper structure of such an audio input DNN makes an optimization problem more difficult, which could easily fall in one of the local minimum solutions. In order to alleviate the problem, we propose a new HW network with a time-delayed bottleneck layer (TDB-HW). Our TDB-HW networks can learn a bottleneck feature representation through optimization based on the cross-entropy criterion without stage-wise training proposed in [1]. Moreover, we use the complex DFT feature as a method of pre-processing. Our experimental results on the real data show that the TDB-HW network with the complex DFT feature provides significantly lower miss rates for a range of false alarm rates over the LFBE DNN, yielding approximately 20 % relative improvement in the area under the curve (AUC) of the detection error tradeoff (DET) curves for keyword spotting. Furthermore, we investigate the effects of different pre-processing methods for the deep highway network.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call