Abstract

This work presents a Deep learning and Vision Transformer hybrid sequence model for the classification and identification of Human Motion Actions. The deep learning model works by extracting Spatial-temporal features from the features of every video, and then we use a CNN model that takes these inputs as spatial features map from videos and outputs them as a sequence of features. These sequences will be temporally fed into the Vision Transformer (ViT) which classifies the videos used into 7 different classes: Jump, Walk, Wave1, wave2, Bend, Jack, and powerful jump. The model was trained and tested on the Weismann dataset and the results showed that such a model was accurately capable of identifying the human actions. Keywords: Deep Learning, Vision Transformer, Human Motion Action Detection, Spatial features, CNN.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call