CLIP2TF:Multimodal video–text retrieval for adolescent education

Xiaoning Sun,Tao Fan,Hongxu Li,Guozhong Wang,Peien Ge,Xiwu Shang

doi:10.1016/j.displa.2024.102801

Abstract

With the rapid advancement of artificial intelligence technology, particularly within the sphere of adolescent education, a continual emergence of new challenges and opportunities is observed. The current educational system increasingly requires the automation of teaching activities detection and evaluation, offering fresh perspectives for enhancing the quality of adolescent education. Although large-scale models are receiving significant attention in educational research, their high demand for computational resources and limitations in specific applications constrain their widespread use in analyzing educational video content, especially when handling multimodal data. Current multimodal contrastive learning methods, which integrate video, audio, and text information, have achieved certain successes in video–text retrieval tasks. However, these methods typically employ simpler weighted fusion strategies and fail to avoid noise and information redundancy. Therefore, our study proposes a novel network framework, CLIP2TF, which includes an efficient audio–visual fusion encoder. It aims to dynamically interact and integrate visual and audio features, further enhancing the visual features that may be missing or insufficient in specific teaching scenarios while effectively reducing redundant information transfer during the modality fusion process. Through ablation experiments on the MSRVTT and MSVD datasets, we first demonstrate the effectiveness of CLIP2TF in video–text retrieval tasks. Subsequent tests on teaching video datasets further proves the applicability of the proposed method. This research not only showcases the potential of artificial intelligence in the automated assessment of teaching quality but also provides new directions for research in related fields studies.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

CLIP2TF:Multimodal video–text retrieval for adolescent education

Abstract

Talk to us

Similar Papers

More From: Displays

Lead the way for us

Similar Papers

End-to-end multimodal clinical depression recognition using deep neural networks: A comparative analysis
Muhammad Muzammel ... Alice Othmani
Computer Methods and Programs in Biomedicine | VOL. 211
Muhammad Muzammel, et. al.Muhammad Muzammel ... Alice Othmani
28 Sep 2021
Computer Methods and Programs in Biomedicine | VOL. 211

Analysis of correlation between audio and visual speech features for clean audio feature prediction in noise
Ibrahim Almajai ... Jonathan Darch
-
Ibrahim Almajai, et. al.Ibrahim Almajai ... Jonathan Darch
17 Sep 2006
17 Sep 2006

Audio-Visual Event Localization by Learning Spatial and Semantic Co-Attention
Cheng Xue ... Hao Chen
IEEE Transactions on Multimedia | VOL. 25
Cheng Xue, et. al.Cheng Xue ... Hao Chen
01 Jan 2023
IEEE Transactions on Multimedia | VOL. 25

Human emotion recognition from videos using spatio-temporal and audio features
Munaf Rashid ... S A R Abu-Bakar
The Visual Computer | VOL. 29
Munaf Rashid, et. al.Munaf Rashid ... S A R Abu-Bakar
07 Dec 2012
The Visual Computer | VOL. 29

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

CLIP2TF:Multimodal video–text retrieval for adolescent education

Abstract

Talk to us

Similar Papers

More From: Displays