Abstract

In recent years, with the increasing frequency of international exchanges, people have gradually realized that language is a tool of communication and communication, and language learning should attach importance to oral teaching. However, in traditional classrooms, one of the problems faced by oral teaching is the mismatch of the teacher-student ratio: a teacher has to deal with dozens of students, one-on-one oral teaching and pronunciation guidance is impossible, and it is also affected by the teachers and the environment constraints. Therefore, the research on how to efficiently automate pronunciation training is becoming more and more popular. Many phonemes in English have different facial visual features, especially vowels. Almost all of them can be distinguished by the roundness and tightness of the lips in appearance. In order to give full play to the role of lip features in oral pronunciation error detection, this paper proposes a multimodal feature fusion model based on lip angle features. The model interpolates the lip features constructed based on the opening and closing angles and combines audio and video in time series. Feature alignment and fusion and feature learning and classification are realized through the two-way LSTM SOFTMAX layer, and finally, end-to-end pronunciation error detection is realized through CTC. It is verified on the GRID audio and video corpus after phoneme conversion and the self-built multimodal test set. The experimental results show that the model has a higher false pronunciation recognition rate than the traditional single-modal acoustic error detection model. The increase in error detection rate is more obvious. Verification by the audio and video corpus with white noise was added, and the proposed model has better noise immunity than the traditional acoustic model.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call