Abstract

AbstractSpeech‐emotion analysis plays an important role in English teaching. The existing convolutional neural networks (CNNs) can fully explore the spatial features of speech information, and cannot effectively utilize the temporal dependence of speech signals. In addition, it is difficult to build a more efficient and robust sentiment analysis system by solely utilizing speech information. With the development of the Internet of Things (IoTs), online multimodal information, including speech, video, and text, has become more convenient. To this end, this paper proposes a novel multimodal fusion emotion analysis system. Firstly, by combining convolutional networks with Transformer encoders, the spatiotemporal dependencies of speech information are effectively utilized. To improve multimodal information fusion, we introduce the exchange‐based fusion mechanism. The experimental results on the public dataset indicate that the proposed multimodal fusion model achieves the best performance. In online English teaching, teachers can effectively improve the quality of teaching by leveraging the feedback information of students' emotional states through our proposed deep model.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call