HCMS: Hierarchical and Conditional Modality Selection for Efficient Video Recognition

Zejia Weng,Hengduo Li,Jingjing Chen,Zuxuan Wu,Yu-Gang Jiang

doi:10.1145/3572776

Abstract

Videos are multimodal in nature. Conventional video recognition pipelines typically fuse multimodal features for improved performance. However, this is not only computationally expensive but also neglects the fact that different videos rely on different modalities for predictions. This article introduces Hierarchical and Conditional Modality Selection (HCMS), a simple yet efficient multimodal learning framework for efficient video recognition. HCMS operates on a low-cost modality, i.e., audio clues, by default, and dynamically decides on-the-fly whether to use computationally expensive modalities, including appearance and motion clues, on a per-input basis. This is achieved by the collaboration of three LSTMs that are organized in a hierarchical manner. In particular, LSTMs that operate on high-cost modalities contain a gating module, which takes as inputs lower-level features and historical information to adaptively determine whether to activate its corresponding modality; otherwise, it simply reuses historical information. We conduct extensive experiments on two large-scale video benchmarks, FCVID and ActivityNet, and the results demonstrate the proposed approach can effectively explore multimodal information for improved classification performance while requiring much less computation.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

HCMS: Hierarchical and Conditional Modality Selection for Efficient Video Recognition

Abstract

Talk to us

Similar Papers

More From: ACM Transactions on Multimedia Computing, Communications, and Applications

Lead the way for us

Journal: ACM Transactions on Multimedia Computing, Communications, and Applications	Publication Date: Sep 27, 2023
Citations: 3

Similar Papers

FrameExit: Conditional Early Exiting for Efficient Video Recognition
Amir Ghodrati ... Babak Ehteshami Bejnordi
-
Amir Ghodrati, et. al.Amir Ghodrati ... Babak Ehteshami Bejnordi
01 Jun 2021
01 Jun 2021

SAM: Modeling Scene, Object and Action With Semantics Attention Modules for Video Recognition
Xing Zhang ... Zuxuan Wu
IEEE Transactions on Multimedia | VOL. 24
Xing Zhang, et. al.Xing Zhang ... Zuxuan Wu
09 Jan 2021
IEEE Transactions on Multimedia | VOL. 24

A Dynamic Frame Selection Framework for Fast Video Recognition.
Zuxuan Wu ... Hengduo Li
IEEE Transactions on Pattern Analysis and Machine Intelligence | VOL. 44
Zuxuan Wu, et. al.Zuxuan Wu ... Hengduo Li
07 Oct 2020
IEEE Transactions on Pattern Analysis and Machine Intelligence | VOL. 44

AdaFrame: Adaptive Frame Selection for Fast Video Recognition
Zuxuan Wu ... Chih-Yao Ma
-
Zuxuan Wu, et. al.Zuxuan Wu ... Chih-Yao Ma
01 Jun 2019
01 Jun 2019

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

HCMS: Hierarchical and Conditional Modality Selection for Efficient Video Recognition

Abstract

Talk to us

Similar Papers

More From: ACM Transactions on Multimedia Computing, Communications, and Applications