In the digital education landscape, cross-modal retrieval (CMR) from multimodal educational slides represents a significant challenge, particularly because of the complex nature of academic content, which includes images, diagrams, equations, and tables across various subjects such as mathematics and biology. Current CMR systems are primarily designed for “(natural) image to text” interactions (or vice versa) and inadequately address real-world educational scenarios. This study presents EduCross, a novel framework devised to enhance CMR within multimodal educational slides, which is a domain in which traditional retrieval systems fall short. Recognizing the imperative for a system that is tailored to the educational context, EduCross integrates dual adversarial bipartite hypergraph learning, harnessing the capabilities of generative adversarial networks with figure-text dual channels. This powerful combination facilitates robust bidirectional mapping, allowing for the precise association of figures with their descriptive spoken language segments and ensuring a comprehensive CMR experience. Specifically, we develop framelet-based deep bipartite hypergraph neural networks that effectively manage the high-order relationships between diverse educational content types and various types of slide figures. Our experimental results underscore the superior performance of EduCross, demonstrating its effectiveness through the use of the real Multimodal Lecture Presentations dataset that mirrors authentic educational settings. These outcomes highlight the significant advancements of EduCross over existing methods, marking a leap forward in the accurate retrieval of multimodal educational content.
Read full abstract