Abstract

Current action recognition studies enjoy the benefits of two neural network branches, spatial and temporal. This work aims to extend the previous work by introducing a fusion of spatial and temporal branches to provide superior action recognition capability toward multi-label multi-class classification problems. In this paper, we propose three fusion models with different fusion strategies. We first build several efficient temporal Gaussian mixture (TGM) layers to form spatial and temporal branches to learn a set of features. In addition to these branches, we introduce a new deep spatio-temporal branch consisting of a series of TGM layers to learn the features that emerged from the existing branches. Each branch produces a temporal-aware feature that assists the model in understanding the underlying action in a video. To verify the performance of our proposed models, we performed extensive experiments using the well-known MultiTHUMOS benchmarking dataset. The results demonstrate the importance of our proposed deep fusion mechanism, contributing to the overall score while keeping the number of parameters small.

Highlights

  • Action recognition is currently a topic of active research due to the challenges that researchers must be overcome in this field and due to the importance of action recognition applications in our daily lives

  • We propose a deep spatio-temporal branch comprised by several temporal Gaussian mixture (TGM) layers to boost the accuracy in estimating the multi-label, multi-class problem for a given action video

  • Several studies have investigated the use of a recurrent network, such as long short-term memory (LSTM) or bi-directional long short-term memory (Bi-LSTM) to classify an action, as described in [28]–[31]. These authors argued that implementing a recurrent network on top of a convolutional neural network (CNN) backbone will enable the capture of sequential information of a video

Read more

Summary

INTRODUCTION

Action recognition is currently a topic of active research due to the challenges that researchers must be overcome in this field and due to the importance of action recognition applications in our daily lives. One of the challenges in classifying an action in a video is to capture accurately the hidden temporal relation between frames in addition to capture the spatial semantics To take both aspects into account, the filters of the convolutional layer must operate on 3-dimensional data (3D), in contrast with the more general approach for image classification that uses only 2-dimensional (2D) convolution. A different approach has been proposed to use optical flow fields as an additional input modality, while keeping the RGB images as the main input This modality focuses only on pixel displacement over time, thereby removing unnecessary spatial information contained in frames. We propose a deep spatio-temporal branch comprised by several TGM layers to boost the accuracy in estimating the multi-label, multi-class problem for a given action video.

RELATED WORK
OUR PROPOSED FUSION MODELS
EXPERIMENT
Findings
CONCLUSION
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call