With the development of multimedia technologies, surveillance videos and other multimedia data have received widespread attention in several fields. Surveillance videos can monitor students' learning statuses in real time. However, the current action recognition methods for teaching have limitations. First, the ethical privacy of AI and education makes public datasets on student behavior scarce. Therefore, based on the summarization of seven typical student behaviors in the classroom, course videos were obtained from the smart classroom to generate a dataset of student behavior. Compared with existing student behavior recognition datasets, the proposed dataset is distinguished by cluttered backgrounds, crowded scenes, and occlusions. Second, relational reasoning using existing methods is not ideal for distinguishing between students' body parts and small objects in a cluttered background; the interactive utilization rate of different relational features is low, and it cannot take advantage of the complementarity of different relational features, resulting in poor performance of interaction action recognition. Therefore, the attention-based relational reasoning module strengthens the interactive representation between small objects and human body parts. At the same time, considering that there is a certain complementary relationship between relational features, this study constructs a relational feature fusion module which models a human-to-human interaction relationship built upon supporting human's body part and surrounding context. Finally, the reconstructed features and human-appearance features were fused to achieve accurate interactive action recognition. Through an experimental comparison between the proposed and current mainstream algorithms on the generated student behavior dataset, it was verified that the proposed model achieves state-of-the-art performance in action recognition.
Read full abstract