In recent years, more and more people have begun to use massive online open course (MOOC) platforms for distance learning. However, due to the space–time isolation between teachers and students, the negative emotional state of students in MOOC learning cannot be identified timely. Therefore, students cannot receive immediate feedback about their emotional states. In order to identify and classify learners’ emotions in video learning scenarios, we propose a multimodal emotion recognition method based on eye movement signals, audio signals, and video images. In this method, two novel features are proposed: feature of coordinate difference of eyemovement (FCDE) and pixel change rate sequence (PCRS). FCDE is extracted by combining eye movement coordinate trajectory and video optical flow trajectory, which can represent the learner’s attention degree. PCRS is extracted from the video image, which can represent the speed of image switching. A feature extraction network based on convolutional neural network (CNN) (FE-CNN) is designed to extract the deep features of the three modals. The extracted deep features are inputted into the emotion classification CNN (EC-CNN) to classify the emotions, including interest, happiness, confusion, and boredom. In single modal identification, the recognition accuracies corresponding to the three modals are 64.32%, 74.67%, and 71.88%. The three modals are fused by feature-level fusion, decision-level fusion, and model-level fusion methods, and the evaluation experiment results show that the method of decision-level fusion achieved the highest score of 81.90% of emotion recognition. Finally, the effectiveness of FCDE, FE-CNN, and EC-CNN modules is verified by ablation experiments.
Read full abstract