Multi‐cue combination network for action‐based video classification

Yan Tian,Tao Yang,Chao Song,Jiachen Wu,Wei Hu,Yifan Cao

doi:10.1049/iet-cvi.2018.5492

Abstract

Action‐based video classification (or video‐based action recognition) is an active research area in computer vision. However, all currently utilised action‐based video classification approaches take spatial and temporal components into consideration while acoustic features (e.g. sound and speech) are neglected. In this study, the authors propose a novel approach to combine multiple cues (i.e. both visual and acoustic information) for action‐based video classification. Additionally, they introduce dense connections into their three‐stream network to address the gradient vanishing problem. Experimental results in the Kinetics Human Action Video data set and the Kinetics‐Sounds data set shows that their approach can effectively improve the accuracy in action‐based video classification.

Full Text