Speech emotion analysis plays an important role in English teaching by analyzing the reading state of students. Teachers can dynamically adjust the teaching content according to the emotional feedback of students and improve the teaching quality of the school. Due to unstable student emotions and background noise, the accuracy of speech emotion recognition is constrained. Although multimodal data can alleviate the deficiency of a single modality, collecting and annotating multimodal samples requires a significant amount of resources. To resolve this issue, this paper proposes a novel multimodal sentiment analysis framework based on domain adaptive learning mechanisms to assist English teaching. We construct a novel multi-task variation autoencoder framework in which we simultaneously complete reconstruction and classification tasks. To improve speech emotion recognition performance, we introduce domain adaptive learning based on the Wasserstein distance between two variational hidden layers from the video domain (source domain) and speech domain (target domain). To validate the effectiveness of our proposed model, we conducted extensive comparative experiments on two public datasets and a self-built English oral dataset. All experimental results indicate that domain adaptation learning mechanisms can effectively improve the recognition performance of the target domain. On the self-built dataset for English teaching, the proposed model achieves higher performance compared to other deep models.
Read full abstract