Abstract

In recent years, speech recognition technology based on deep learning model has made great progress, and the accuracy of speech recognition has reached more than 90%. In foreign language learning, speech evaluation is an important application. Billions of foreign language learners need to practice effective pronunciation. However, due to the different goals between speech recognition and speech evaluation, a single speech recognition model cannot be directly applied to pronunciation evaluation. This paper proposes a DDNN (double-layer deep neural network) model, which includes the speech text alignment model and speech recognition model. In the first layer of the speech alignment model, a new Viterbi algorithm method is proposed to find the best path for the alignment of speech and text. In the second layer of speech evaluation and scoring, we are the first to use the CNN (Convolutional Neural Network) and RNN (Recurrent Neural Network) on the encoding part of Attention. The accuracy of CTC model reaches 76.7%, and that of attention model is 81.2%. The experimental results show that the speech and text alignment method is effective, and the speech evaluation results based on the Attention model are better. The FRR (false rejection rate), FAR (false acceptance rate), and DER (diagnostic rate) in the Attention model were 4.5%, 5.1%, and 17.9%, respectively. At the same time, the evaluation of each sentence of the DDNN model in the online experiment is within 1 second, so the model can also be applied to the online real-time evaluation of speech pronunciation.

Highlights

  • With the advent of globalization, the number of people are learning foreign languages are increasing

  • More and more researchers begin involved in the study of CALL (Computer-Aided Language Learning), a research field of speech recognition

  • With reference to [18], the whole detection and diagnosis of phonetic errors are classified into three parts; research based on pronunciation scoring, speech recognition network based on forced alignment, and study on acoustic characterization and modeling

Read more

Summary

INTRODUCTION

With the advent of globalization, the number of people are learning foreign languages are increasing. With reference to [18], the whole detection and diagnosis of phonetic errors are classified into three parts; research based on pronunciation scoring, speech recognition network based on forced alignment, and study on acoustic characterization and modeling. The two-layer deep learning neural network model based on CTC and Attention is proposed to detect Japanese pronunciation errors, and the state-to-art effect is achieved; 2. The word-level phoneme recognition combining CNN with LSTM and Attention is proposed to detect pronunciation errors, and compared with the detection results of CTC based on LSTM, the former is better; 4. According to these pronunciation characteristics of the language, we have achieved phoneme-level alignment in the first model, we still output in word units in order to avoid the inconsistent effect of phonemes and reduce the accuracy of forced alignment Another key advantage is that speech in words does not cause the loss of information about the phoneme context. In a task with high training accuracy, as its name suggests, CTC is designed for temporal classification tasks in [44]; that is for sequence labeling problems where the alignment between the inputs and the target labels is unknown

CTC algorithm
Result
Findings
VIII. Conclusion and future works
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.