Abstract

In this paper, we present the XMUSPEECH systems for Track 2 of the Interspeech 2020 Accented English Speech Recognition Challenge (AESRC2020). Track 2 is an Automatic Speech Recognition (ASR) task where the non-native English speakers have various accents, which reduces the accuracy of the ASR system. To solve this problem, we experimented with acoustic models and input features. Furthermore, we trained a TDNN-LSTM language model for lattice rescoring to obtain better results. Compared with our baseline system, we achieved relative word error rate (WER) improvements of 40.7% and 35.7% on the development set and evaluation set, respectively.

Highlights

  • The standard English Automatic Speech Recognition (ASR) system has been able to obtain a high recognition accuracy and meet the commercial requirements of certain scenarios

  • Model M1 is a model trained without any auxiliary embeddings, and its word error rate (WER) was highest in Table 3, so we can conclude that using embeddings as complementary features can significantly improve the performance of an accented English speech recognition system

  • We explored various approaches to improve the accuracy of the accented ASR system

Read more

Summary

Introduction

The standard English ASR system has been able to obtain a high recognition accuracy and meet the commercial requirements of certain scenarios. There is much interest in developing the acoustic model [9,10,11], Ahmed et al [11] proposed a convolutional neural network (CNN)-based architecture, which had variable filter sizes along the frequency band of the audio utterances, and the overall accuracy for accent speech recognition surpassed all of the prior work. Shi et al [9] used TDNN [12] as an acoustic model for accented English speech recognition and achieved a relatively low average WER. We followed the conventional steps to train hybrid GMM-HMM acoustic models referring to the Kaldi [19] recipe for CHIME6 (https://github.com/kaldi-asr/kaldi/tree/ master/egs/chime6/s5_track (accessed on 24 August 2021)). Multistream CNN: We positioned a 5-layer CNN to better accommodate the top SpecAugment layer, followed by an 11-layer multistream CNN [14]

Multistream CNN Architecture
Effect of Acoustic Model
Effect of Language Model Rescoring
Conclusions

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.