The variety and complexity of accents pose a huge challenge to robust Automatic Speech Recognition (ASR). Some previous work has attempted to address such problems, however most of the current approaches either require prior knowledge about the target accent, or cannot handle unseen accents and accent-unspecific standard speech. In this work, we aim to improve multi-accent speech recognition in the end-to-end (E2E) framework with a novel layer-wise adaptation architecture. Firstly, we propose a robust deep accent representation learning architecture to obtain accurate accent embedding, and some advanced schemes are designed to further boost the quality of accent embeddings, including phone posteriorgram (PPG) feature, TTS based data augmentation in the training stage, test-time augmentation and multi-embedding fusion in the testing stage. Then, the layer-wise adaptation with accent embeddings is developed for fast accent adaptation in ASR, and two types of adapter layers are designed, including the gated adapter layer and multi-basis adapter layer. Compared to the usual two-pass adaptation, these adapter layers are injected between the ASR encoder layers to encode the accent information in ASR flexibly, and perform fast adaption on the corresponding speech accent. The experiments on Accent AESRC corpus show that the proposed deep accent representation learning can capture accurate accent knowledge, and get high performance on accent classification. The new layer-wise adaptation architecture with the accurate accent embedding outperforms the other traditional methods, and obtains consistent <inline-formula><tex-math notation="LaTeX">$\sim$</tex-math></inline-formula>15% relative word error rate (WER) reduction on all kinds of testing scenarios, including seen accents, unseen accents and accent-unspecific standard speech.
Read full abstract