Abstract

In recent years, economic globalization is the trend, and communication between countries is getting closer and closer; more and more people begin to pay attention to learning spoken English. The development of computer-aided language learning makes it more convenient for people to learn spoken English; however, the detection and correction of incorrect English pronunciation, which is its core, are still inadequate. In this paper, we propose a multimodal end-to-end English pronunciation error detection and correction model based on audio and video, which does not require phoneme forced alignment of the English pronunciation video signal to be processed, and uses rich audio and video features for English pronunciation error detection, which improves the error detection accuracy to a great extent especially in noisy environments. To address the shortcomings of the current lip feature extraction algorithm which is too complicated and not enough characterization ability, a feature extraction scheme based on the lip opening and closing angle is proposed. The lip syllable frames are obtained by video frame splitting, the syllables are denoised, the key point information of the lips is obtained using a gradient enhancement-based regression tree algorithm, the effects of speaker tilt and movement are removed by scale normalization, and finally, the lip opening and closing angles are calculated using mathematical geometry, and the lip feature values are generated by combining the angle changes.

Highlights

  • Economic globalization is the trend, and the relationship between countries is getting closer and closer, and all countries in the world have in effect become a “global village.” English, as the most widely spoken language in the world, is crucial to economic and cultural exchanges between countries, so it is widely learned and used in all countries

  • Many MDD studies focus on the results of mispronunciation and ignore the causes of speakers’ mispronunciation, which makes it impossible to provide corrective advice from pronunciation actions; MDD studies are conducted based on voice unimodality, ignoring the importance of lip features during pronunciation; most pronunciation detection and error correction studies ignore the influence of noise on detection results

  • The end-to-end English spoken pronunciation error detection algorithm based on multimodal acoustic sensors proposed in this paper is effective in both phoneme sequence recognition and error detection rate and possesses a certain degree of noise immunity

Read more

Summary

Introduction

Economic globalization is the trend, and the relationship between countries is getting closer and closer, and all countries in the world have in effect become a “global village.” English, as the most widely spoken language in the world, is crucial to economic and cultural exchanges between countries, so it is widely learned and used in all countries. Computers have entered people’s lives and led to the boom of online language teaching, but most online teaching only provides videos of correct spoken English pronunciation but does not check students’ English pronunciation and point out their English pronunciation errors. The multimodal syllable fusion technique can take multiple information about the same object acquired by multiple sensors and process it in different ways (null-frequency domain conversion, feature extraction, or decision-level determination) to obtain unified information, obtaining richer, more accurate, and more reliable information [3]. We propose a multimodal end-to-end English pronunciation error detection and correction model based on audio and video, and we propose to analyze the causes of errors in the perception process to reveal the adaptability of different perception methods to the dynamic environment and to build a domain knowledge base for a cognitive dynamic environment. We propose an autonomous perception model based on the listening mechanism and research autonomous English pronunciation recognition methods for intelligent machines

Related Work
English Pronunciation Standards Based on Multimodal Acoustic Sensors
Design of Multimodal Acoustic Sensors in English Pronunciation
Experimental Design and Analysis
Findings
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call