Abstract
Spoken accents severely degrade the performance of automatic speech recognition (ASR) systems. Domain adversarial training (DAT) is widely adopted for generating domain-invariant features to reduce the influence of accents. However, the generated features trained by DAT are still maintaining some accent discrimination information, limiting the ASR performance. In addition, the features generated by DAT of each accent have different degrees of residual accent discriminant information. In this paper, we propose an adaptive attention network with DAT to further eliminate the influence of retaining accent information in features generated by DAT. We employ the adaptive attention module to transform the encoder output to a more general representation. Experiments on the AESRC2020 dataset show that the proposed method can achieve satisfactory performance improvements on seen and unseen accent when the correct accent information is still preserved in the output of the encoder.
Published Version
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have