Abstract

Action‐based video classification (or video‐based action recognition) is an active research area in computer vision. However, all currently utilised action‐based video classification approaches take spatial and temporal components into consideration while acoustic features (e.g. sound and speech) are neglected. In this study, the authors propose a novel approach to combine multiple cues (i.e. both visual and acoustic information) for action‐based video classification. Additionally, they introduce dense connections into their three‐stream network to address the gradient vanishing problem. Experimental results in the Kinetics Human Action Video data set and the Kinetics‐Sounds data set shows that their approach can effectively improve the accuracy in action‐based video classification.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call