Toward Long Form Audio-Visual Video Understanding

Wenxuan Hou,Guangyao Li,Yapeng Tian,Di Hu

doi:10.1145/3672079

Abstract

We live in a world filled with never-ending streams of multimodal information. As a more natural recording of the real scenario, long form audio-visual videos (LFAVs) are expected as an important bridge for better exploring and understanding the world. In this article, we propose the multisensory temporal event localization task in long form videos and strive to tackle the associated challenges. To facilitate this study, we first collect a large-scale LFAV dataset with 5,175 videos and an average video length of 210 seconds. Each collected video is elaborately annotated with diversified modality-aware events, in a long-range temporal sequence. We then propose an event-centric framework for localizing multisensory events as well as understanding their relations in long form videos. It includes three phases in different levels: snippet prediction phase to learn snippet features, event extraction phase to extract event-level features, and event interaction phase to study event relations. Experiments demonstrate that the proposed method, utilizing the new LFAV dataset, exhibits considerable effectiveness in localizing multiple modality-aware events within long form videos. We hope that our newly collected dataset and novel approach serve as a cornerstone for furthering research in the realm of LFAV understanding. Project page: https://gewu-lab.github.io/LFAV/ .

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Toward Long Form Audio-Visual Video Understanding

Abstract

Talk to us

Similar Papers

More From: ACM Transactions on Multimedia Computing, Communications, and Applications

Lead the way for us

Similar Papers

Real-Time Multiple Event Detection and Classification in Power System Using Signal Energy Transformations
Ravi Yadav ... Innocent Kamwa
IEEE Transactions on Industrial Informatics | VOL. 15
Ravi Yadav, et. al.Ravi Yadav ... Innocent Kamwa
01 Mar 2019
IEEE Transactions on Industrial Informatics | VOL. 15

Speak2Label: Using Domain Knowledge for Creating a Large Scale Driver Gaze Zone Estimation Dataset
Shreya Ghosh ... Nicu Sebe
-
Shreya Ghosh, et. al.Shreya Ghosh ... Nicu Sebe
01 Oct 2021
01 Oct 2021

PVStereo: Pyramid Voting Module for End-to-End Self-Supervised Stereo Matching
Hengli Wang ... Ming Liu
IEEE Robotics and Automation Letters | VOL. 6
Hengli Wang, et. al.Hengli Wang ... Ming Liu
01 Jul 2021
IEEE Robotics and Automation Letters | VOL. 6

Sequential Voting with Relational Box Fields for Active Object Detection
Qichen Fu ... Kris M Kitani
-
Qichen Fu, et. al.Qichen Fu ... Kris M Kitani
01 Jun 2022
01 Jun 2022

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Toward Long Form Audio-Visual Video Understanding

Abstract

Talk to us

Similar Papers

More From: ACM Transactions on Multimedia Computing, Communications, and Applications