Abstract

Building deep neural network acoustic models directly based on far-field speech from multiple recording environments with different acoustic properties is an increasingly popular approach to address the problem of distant speech recognition. The currently common approach to building such multi-condition (multi-domain) models is to compile available data from all different environments into a single train set, discarding information regarding the specific environment to which each utterance belongs. We propose a novel strategy for training neural network acoustic models based on adversarial training which makes use of environment labels during training. By adjusting the parameters of the initial layers of the network adversarially with respect to a domain classifier trained to recognize the recording environments, we enforce better invariance to the diversity of recording conditions. We provide a motivating study on the mechanism by which a deep network learns environmental invariance, and discuss some relations with existing approaches for improving the robustness of DNN models. The proposed multi-domain adversarial training is evaluated on an end-to-end speech recognition task based on the AMI meeting corpus, achieving a relative character error rate reduction of +3.3% with respect to a conventional multi-condition trained baseline and +25.4% with respect to a clean-trained baseline.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call