Abstract
We propose a novel approach of using a newly appeared fully convolutional network (FCN) architecture, UNet++, for multichannel speech dereverberation and distant speech recognition (DSR). While the previous FCN architecture UNet is good at utilizing time-frequency structures of speech, UNet++ offers better robustness in network depths and skip connections. For DSR, UNet++ serves as a feature enhancement front-end, and the enhanced speech features are used for acoustic model training and recognition. We also propose a frequency-dependent convolution scheme (FDCS), resulting in new variants of UNet and UNet++. We present DSR results from the multiple distant microphone (MDM) datasets of AMI meeting corpus, and compare the performance of UNet++ with UNet and weighted prediction error (WPE). Our results demonstrate that for DSR, the UNet++-based approaches provide large word error rate (WER) reductions over its UNetand WPE-based counterparts. The UNet++ with WPE preprocessing and 4-channel input achieves the lowest WERs. The dereverberation results are also measured by speech-to-dereverberation modulation energy ratio (SRMR), from which large gains of UNet++ over UNet and WPE are also observed.
Published Version
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have