End-to-End Automatic Pronunciation Error Detection Based on Improved Hybrid CTC/Attention Architecture.

Long Zhang,Chang Gao,Linlin Shan,Ziping Zhao,Huazhi Sun,Shiwen Deng,Chunmei Ma,Lifen Jiang

doi:10.3390/s20071809

Abstract

Advanced automatic pronunciation error detection (APED) algorithms are usually based on state-of-the-art automatic speech recognition (ASR) techniques. With the development of deep learning technology, end-to-end ASR technology has gradually matured and achieved positive practical results, which provides us with a new opportunity to update the APED algorithm. We first constructed an end-to-end ASR system based on the hybrid connectionist temporal classification and attention (CTC/attention) architecture. An adaptive parameter was used to enhance the complementarity of the connectionist temporal classification (CTC) model and the attention-based seq2seq model, further improving the performance of the ASR system. After this, the improved ASR system was used in the APED task of Mandarin, and good results were obtained. This new APED method makes force alignment and segmentation unnecessary, and it does not require multiple complex models, such as an acoustic model or a language model. It is convenient and straightforward, and will be a suitable general solution for L1-independent computer-assisted pronunciation training (CAPT). Furthermore, we find that in regards to accuracy metrics, our proposed system based on the improved hybrid CTC/attention architecture is close to the state-of-the-art ASR system based on the deep neural network–deep neural network (DNN–DNN) architecture, and has a stronger effect on the F-measure metrics, which are especially suitable for the requirements of the APED task.

Highlights

With the continuous development of economic globalization and social integration, more and more people are eager to learn a second language
To solve the problem of the empirical parameter of the traditional end-to-end automatic speech recognition (ASR) based on hybrid connectionist temporal classification (CTC)/attention architecture needing to be set manually before the training and remaining unchanged throughout the training process, we introduce an adaptive parameter based on the Sigmoid function that does not need to be set in advance and can be adjusted continuously during training
false acceptance (FA) means that the phone segment, which is marked F by we identify the type (i.e., True acceptance (TA), FA, False rejection (FR), and true rejection (TR)) of each phone in the automatic pronunciation error detection (APED) task according to the experts, is recognized correctly by the ASR

Summary

Introduction

With the continuous development of economic globalization and social integration, more and more people are eager to learn a second language. CALL systems that focus on speech and pronunciation are usually called computer-assisted pronunciation training (CAPT) systems. CAPT systems can efficiently process and analyze the speech uttered by language learners and provide the quantitative or qualitative assessment of pronunciation quality or ability to them as feedback. This process is known as the automatic pronunciation (quality/proficiency) assessment (evaluation/scoring).

Methods

Results

Discussion

Conclusion