Abstract

This paper describes a new architecture of deep neural networks (DNNs) for acoustic models. Training DNNs from raw speech signals will provide 1) novel features of signals, 2) normalization-free processing such as utterance-wise mean subtraction, and 3) low-latency speech recognition for robot audition. Exploiting the longer context of raw speech signals seems useful in improving recognition accuracy. However, naive use of longer contexts results in the loss of short-term patterns; thus, recognition accuracy degrades. We propose a multi-timescale feature-extraction architecture of DNNs with blocks of different time scales, which enable capturing long- and short-term patterns of speech signals. Each block consists of complex-valued networks that correspond to Fourier and filterbank transformations for analysis. Experiments showed that the proposed multi-timescale architecture reduced the word error rate by about 3% compared with those only with the longterm context. Analysis of the extracted features revealed that our architecture efficiently captured the slow and fast changes of speech features.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call