Multi-domain adversarial training of neural network acoustic models for distant speech recognition

Seyedmahdad Mirsamadi,John H.L Hansen

doi:10.1016/j.specom.2018.10.010

Seyedmahdad Mirsamadi, John H.L Hansen

Open Access

https://doi.org/10.1016/j.specom.2018.10.010

Copy DOI

Journal: Speech Communication	Publication Date: Nov 3, 2018
Citations: 18	License type: publisher-specific-oa

Affiliation: The University of Texas at Dallas

Abstract

Building deep neural network acoustic models directly based on far-field speech from multiple recording environments with different acoustic properties is an increasingly popular approach to address the problem of distant speech recognition. The currently common approach to building such multi-condition (multi-domain) models is to compile available data from all different environments into a single train set, discarding information regarding the specific environment to which each utterance belongs. We propose a novel strategy for training neural network acoustic models based on adversarial training which makes use of environment labels during training. By adjusting the parameters of the initial layers of the network adversarially with respect to a domain classifier trained to recognize the recording environments, we enforce better invariance to the diversity of recording conditions. We provide a motivating study on the mechanism by which a deep network learns environmental invariance, and discuss some relations with existing approaches for improving the robustness of DNN models. The proposed multi-domain adversarial training is evaluated on an end-to-end speech recognition task based on the AMI meeting corpus, achieving a relative character error rate reduction of +3.3% with respect to a conventional multi-condition trained baseline and +25.4% with respect to a clean-trained baseline.

Full Text