A multimodal teacher speech emotion recognition method in the smart classroom

Gang Zhao,Yinan Zhang,Jie Chu

doi:10.1016/j.iot.2024.101069

Abstract

As a collection of various IoT devices, smart classroom can record various forms of teaching data and provide rich data for recognizing teachers' emotions. Recognizing and analyzing teachers' emotions can promote teachers' professional development. Nowadays, most of the automatic emotion recognition methods for teachers in smart classroom are based on facial expressions. However, since teachers usually keep smiling to mobilize the classroom atmosphere, the recognition results may not reflect the real mental state of teachers. By observing teaching videos, it is found that the prosody and text in the teachers' speech can reflect the implicit emotion of the teacher. Therefore, a multimodal teacher emotion dataset (MTED) was built based on teaching videos recorded by IoT cameras and microphones in smart classroom. A neural network combining multiple prosodic features and text content for teacher speech emotion recognition is proposed. The proposed method fills the gap in teacher speech emotion recognition, our proposed method has higher accuracy. Experimental results show that ProsodyBERT achieves 78.6 % UA4 and 66.2 % UA6 on IEMOCAP and MELD, respectively, surpassing the existing methods. The proposed method reached 82.1% UA6 on MTED self-built dataset, which is 9.6 %-21.4 % higher than that of unimodal method in teacher emotion recognition. An ablation experiment is designed and implemented on MTED dataset to explore the influence of each module in ProsodyBERT on teacher speech emotion recognition task. The experimental results in the smart classroom record show that ProsodyBERT has higher accuracy and stronger robustness than unimodal methods.

Full Text