Abstract

Appearance and depth-based action recognition has been researched exclusively for improving recognition accuracy by considering motion and shape recovery particulars from RGB-D video data. Convolutional neural networks (CNN) have shown evidences of superiority on action classification problems with spatial and apparent motion inputs. The current generation of CNNs use spatial RGB videos and depth maps to recognize action classes from RGB-D video. In this work, we propose a 4-stream CNN architecture that has two spatial RGB-D video data streams and two apparent motion streams, with inputs extracted from the optical flow of RGB-D videos. Each CNN stream is packed with 8 convolutional layers, 2 dense and one SoftMax layer, and a score fusion model to merge the scores from four streams. Performance of the proposed 4-stream action recognition framework is tested on our own action dataset and three benchmark datasets for action recognition. The usefulness of the proposed model is evaluated with state-of-the-art CNN architectures for action recognition.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call