The XMUSPEECH System for Accented English Automatic Speech Recognition

Fuchuan Tong,Lin Li,Tao Li,Song Li,Shipeng Xia,Qingyang Hong,Dexin Liao

doi:10.3390/app12031478

Fuchuan Tong, Lin Li + Show 5 more

Open Access

https://doi.org/10.3390/app12031478

Copy DOI

Abstract

In this paper, we present the XMUSPEECH systems for Track 2 of the Interspeech 2020 Accented English Speech Recognition Challenge (AESRC2020). Track 2 is an Automatic Speech Recognition (ASR) task where the non-native English speakers have various accents, which reduces the accuracy of the ASR system. To solve this problem, we experimented with acoustic models and input features. Furthermore, we trained a TDNN-LSTM language model for lattice rescoring to obtain better results. Compared with our baseline system, we achieved relative word error rate (WER) improvements of 40.7% and 35.7% on the development set and evaluation set, respectively.

Highlights

The standard English Automatic Speech Recognition (ASR) system has been able to obtain a high recognition accuracy and meet the commercial requirements of certain scenarios
Model M1 is a model trained without any auxiliary embeddings, and its word error rate (WER) was highest in Table 3, so we can conclude that using embeddings as complementary features can significantly improve the performance of an accented English speech recognition system
We explored various approaches to improve the accuracy of the accented ASR system

Summary

Introduction

The standard English ASR system has been able to obtain a high recognition accuracy and meet the commercial requirements of certain scenarios. There is much interest in developing the acoustic model [9,10,11], Ahmed et al [11] proposed a convolutional neural network (CNN)-based architecture, which had variable filter sizes along the frequency band of the audio utterances, and the overall accuracy for accent speech recognition surpassed all of the prior work. Shi et al [9] used TDNN [12] as an acoustic model for accented English speech recognition and achieved a relatively low average WER. We followed the conventional steps to train hybrid GMM-HMM acoustic models referring to the Kaldi [19] recipe for CHIME6 (https://github.com/kaldi-asr/kaldi/tree/ master/egs/chime6/s5_track (accessed on 24 August 2021)). Multistream CNN: We positioned a 5-layer CNN to better accommodate the top SpecAugment layer, followed by an 11-layer multistream CNN [14]

Multistream CNN Architecture

Effect of Acoustic Model

Effect of Language Model Rescoring

Conclusions

Full Text

Paper version not known

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: Applied Sciences	Publication Date: Jan 29, 2022
Citations: 3	License type: CC BY 4.0

R Discovery Prime

R Discovery Prime

The XMUSPEECH System for Accented English Automatic Speech Recognition

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Applied Sciences

Lead the way for us

Similar Papers

SpeeD's DNN approach to Romanian speech recognition
Alexandru-Lucian Georgescu ... Horia Cucu
-
Alexandru-Lucian Georgescu, et. al.Alexandru-Lucian Georgescu ... Horia Cucu
01 Jul 2017
01 Jul 2017

Multi-Turn RNN-T for Streaming Recognition of Multi-Party Speech
Ilya Sklyar ... Anna Piunova
-
Ilya Sklyar, et. al.Ilya Sklyar ... Anna Piunova
23 May 2022
23 May 2022

Training data pseudo-shuffling and direct decoding framework for recurrent neural network based acoustic modeling
Naoyuki Kanda ... Mitsuyoshi Tachimori
-
Naoyuki Kanda, et. al.Naoyuki Kanda ... Mitsuyoshi Tachimori
01 Dec 2015
01 Dec 2015

Optimizing Expected Word Error Rate via Sampling for Speech Recognition
Matt Shannon
-
Matt ShannonMatt Shannon
20 Aug 2017
20 Aug 2017

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

The XMUSPEECH System for Accented English Automatic Speech Recognition

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Applied Sciences