Abstract

This paper addresses the robust beamforming problem for speech recognition using a novel time-frequency mask estimator. The beamformer first estimates the time-frequency mask using a deep neural network (DNN) based on which the covariance matrices of the target speech and noise are computed. Then, the beamformer coefficients are directly obtained via generalized eigenvector decomposition. To achieve accurate covariance matrix estimation for robust beamforming, we propose a DNN-based mask estimator which can exploit the spatial features of the multi-channel microphone signals. The proposed mask estimator leverages the spatial information of the microphone array by using multi-channel signals to estimate a speech-aware mask and a noise-aware mask simultaneously. Using the target-specified masks, accurate covariance matrices of the target speech and noise can be obtained from the observation independently. Experiments on CHiME4 data sets demonstrate that, compared with the baseline toolkit (BeamformIt) and the winner in the CHiME3 challenge, the proposed method achieves better results both in terms of perceptual speech quality and speech recognition error rate.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call