Abstract

Speech dereverberation based on deep learning has recently gained a remarkable success with the substantial improvement of speech recognition for the accuracy in the distant speech recognition task. However, environmental mismatches due to noise and reverberation may result in performance degradation when the features (e.g. MFCCs) are simply fed into a speech recognition system without feature enhancement. To address the problem, we propose a new speech dereverberation approach based on the deep convolution and self-attention mechanisms to enhance the MFCC-based feature in distant signals. The deep convolutional component used in this approach can efficiently exploit the frequency-temporal context patterns, and the multi-head self-attention mechanism can obtain the complete time-domain cues to enhance the temporal context. Meanwhile, the bottleneck features trained on a clean corpus are utilized as teacher signals, because they contain relevant cues to phoneme classification and the mapping is performed with the objective of suppressing noise and reverberation. Extensive experimental results on the REVERB challenge corpus demonstrate that our proposed approach outperforms all the competitors, reducing about 17% relative word error rate (WER) compared with the deep neural network (DNN) baseline method.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call