Abstract

With the rapid development of Internet technology and educational informatization, there are more and more oral materials available on the Internet. Therefore, how to adapt to learners’ dynamic abilities and provide them with personalized learning materials has become a very important issue in educational technology. Aiming at the inefficiency of the existing automatic assessment of spoken English, a multimodal-based automatic assessment method of spoken English is proposed. The Word2Vec model is used to extract text features, and then, speech and text are input into GRU temporal structure, and encoder coding is used for multimodal fusion to realize automatic evaluation of multimodal spoken English. Simulation results show that the multimodal model proposed in this paper is superior to the traditional oral English automatic assessment model in terms of fluency, emotional expression, and sense of rhythm and can better improve learners’ oral English level.

Highlights

  • With the importance of international English communication, how to enhance learners’ oral English level has become the focus of current thinking

  • Many scholars propose to combine modern information technology to improve learners’ oral level, such as Constantinides G A et al [1] build an automatic speech assessment model according to speech signals, Kovacs G et al [2] use deep learning methods to optimize the automatic speech evaluation model, and Zhang and Qin [3] evaluate speech and use DTW template matching to establish an automatic evaluation model

  • In [5], the automatic evaluation method of speech quality based on speech signal detection and dynamic synchronous recognition can be used for automatic evaluation of spoken English

Read more

Summary

Introduction

With the importance of international English communication, how to enhance learners’ oral English level has become the focus of current thinking. Yi et al [4] use multimodal neural network to predict the evaluation of press conference professionally by using collected text and audio data It is composed of three parts: language model, audio model, and feature fusion network. Ere are two kinds of feature fusion methods in the audio model of multimodal neural network, which are shared attention network, text feature generation, and audio feature generation Compared with the former, the performance of the latter is much better, and the accuracy rate is generally 60%. To solve the above problems, this paper proposes a multimodal approach for automatic assessment of oral English, which combines pronunciation and text, and constructs an automatic assessment model of oral English by means of joint learning, so as to provide a more effective auxiliary tool for oral English scoring. By optimizing the output layer to build a tree structure, the multiclassification problem is transformed into log(V) binary classification problem, which greatly improves the training efficiency of the model

Automatic Oral English Assessment Model Based on Multimodality
Experiment and Analysis
Findings
Conclusion
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.