MVAFormer: RGB-Based Multi-View Spatio-Temporal Action Recognition with Transformer

  • Abstract
  • Literature Map
  • Similar Papers
Abstract
Translate article icon Translate Article Star icon

Multi-view action recognition aims to recognize human actions using multiple camera views and deals with occlusion caused by obstacles or crowds. In this task, cooperation among views, which generates a joint representation by combining multiple views, is vital. Previous studies have explored promising cooperation methods for improving performance. However, since their methods focus only on the task setting of recognizing a single action from an entire video, they are not applicable to the recently popular spatio-temporal action recognition (STAR) setting, in which each person’s action is recognized sequentially. To address this problem, this paper proposes a multi-view action recognition method for the STAR setting, called MVAFormer. In MVAFormer, we introduce a novel transformer-based cooperation module among views. In contrast to previous studies, which utilize embedding vectors with lost spatial information, our module utilizes the feature map for effective cooperation in the STAR setting, which preserves the spatial information. Furthermore, in our module, we divide the self-attention for the same and different views to model the relationship between multiple views effectively. The results of experiments using a newly collected dataset demonstrate that MVAFormer outperforms the comparison baselines by approximately 4.4 points on the F-measure.

Similar Papers
  • Research Article
  • Cite Count Icon 30
  • 10.1016/j.jvcir.2016.10.016
Spatio-temporal action localization and detection for human action recognition in big dataset
  • Oct 31, 2016
  • Journal of Visual Communication and Image Representation
  • Sameh Megrhi + 3 more

Spatio-temporal action localization and detection for human action recognition in big dataset

  • Conference Article
  • Cite Count Icon 3
  • 10.1117/12.2082880
Spatio-temporal action localization for human action recognition in large dataset
  • Mar 4, 2015
  • Proceedings of SPIE, the International Society for Optical Engineering/Proceedings of SPIE
  • Sameh Megrhi + 3 more

Human action recognition has drawn much attention in the field of video analysis. In this paper, we develop a human action detection and recognition process based on the tracking of Interest Points (IP) trajectory. A pre-processing step that performs spatio-temporal action detection is proposed. This step uses optical flow along with dense speed-up-robust-features (SURF) in order to detect and track moving humans in moving fields of view. The video description step is based on a fusion process that combines displacement and spatio-temporal descriptors. Experiments are carried out on the big data-set UCF-101. Experimental results reveal that the proposed techniques achieve better performances compared to many existing state-of-the-art action recognition approaches.

  • Supplementary Content
  • 10.1184/r1/13198112.v1
Leveraging Context for Multi-Label Action Recognition and Detection in Video
  • Nov 5, 2020
  • Figshare
  • Joao Antunes Martins

This thesis addresses video-based multi-person, multi-label, spatiotemporal action detection and recognition. This is a challenging problem because each person can be performing several actions at the same time (e.g. talking and walking), and simultaneously other actors can be performing different actions. We claim that these are problems where the use of contextual information (e.g. semantic descriptions of the scene) may lead to significant performance improvements. In this work, we develop several approaches to tackle this problem and validate them in challenging datasets. We propose a framework to integrateand test multiple sources of contextual information in video-based multi-person, multi-label, spatiotemporal action detection and recognition. We highlight six contributions,and that are collected in three publications (at different stages of publication at the time of this writing). The first contribution is a proposed Multisource Video Classification(MVC) framework that allows the combination of several sources of context information, for which we consider four types: actor centric input filtering (a way to focus attentionon the actor under analysis but still gather appearance information from the neighborhood), semantic neighbor context (a way to inform the model with the actions performed by nearby agents), object detection (how objects interacting with the actor can inform about its action) and pose data (how high level features extracted from the actor can help the classification process). The second contribution is a foveated approach to actor centric filtering for input selection that weights the appearance information in a decreasing way, from the center to the periphery of the actor bounding box. The third contribution is a novel encodingfor the semantic neighbor context and its custom classifier with spatial and temporal dependence. The fourth is a custom Hybrid Sigmoid-Softmax loss function for the multiclass/ multi-label case, that combines the cross-entropy loss typical of multi-class problems with the sum-of-sigmoids loss used in the multi label case. The fifth is the application of the developed methods to a challenging dataset with a large number of videos with mulivtiple agents performing multiple actions, with 80 heterogeneous and highly unbalanced classes. To allow research with reasonable computer power, we have created the mini-AVA, a partition of AVA that maintains temporal continuity and class distribution with only one tenth of the dataset size. The sixth contribution is a collection of ablation studies on alternative actor centric filters and semantic neighbor context classifiers. From this research we achieve a relative mAP improvement of 18:8% using our foveated actor centricfiltering, relative mAP improvement of 5% using our semantic neighbor context embedding and models, and a relative mAP improvement of 12:6% using our custom Hybrid Sigmoid-Softmax loss.

  • Research Article
  • 10.1016/j.clinph.2016.05.041
EPV 19. Dissociable regions for recognition and execution of conceptual and spatio-temporal action characteristics in acute stroke patients
  • Aug 4, 2016
  • Clinical Neurophysiology
  • M Martin + 7 more

EPV 19. Dissociable regions for recognition and execution of conceptual and spatio-temporal action characteristics in acute stroke patients

  • Conference Article
  • Cite Count Icon 18
  • 10.1061/9780784412343.0064
Automated Benchmarking and Monitoring of an Earthmoving Operation's Carbon Footprint Using Video Cameras and a Greenhouse Gas Estimation Model
  • Jun 11, 2012
  • A Heydarian + 2 more

Benchmarking and monitoring are critical steps toward improving operational efficiency of earthmoving equipment and minimizing their environmental impacts. Despite the importance, the relationship between operational efficiency and total pollutant emissions of these operations has not been fully understood. To establish such relationship and find ways to minimize the excessive environmental impacts due to reduced operational efficiencies, there is a need for an inexpensive and automated benchmarking and monitoring method. This paper presents a novel cost-effective method for monitoring carbon footprint of earthmoving operations using a visionbased equipment action recognition method along with pollutant emission inventories of construction actions. First a site video stream is represented as a collection of spatio-temporal features by extracting space-time interest points and describing each feature with a histogram of oriented gradients. The algorithm automatically learns the probability distributions of the spatio-temporal features and action categories using a multiple binary support vector machine classifier. Next, using a new temporal sliding window model, the equipment action categories are classified over a long sequence of video frames. The recognized time-series of equipment actions are placed in an emission and carbon footprint estimation model where based on the amount of emission for each equipment action, the overall Green House Gas emissions are analyzed. The proposed method is validated for several videos collected on an ongoing construction project. The preliminary results with average action recognition accuracy of 85% reflect the promise that the proposed approach can help practitioners understand operational efficiency of their construction activities and minimize excessive environmental impacts due to reduced operational efficiencies.

  • Research Article
  • Cite Count Icon 2
  • 10.14569/ijacsa.2021.0120276
An Extensive Analysis of the Vision-based Deep Learning Techniques for Action Recognition
  • Jan 1, 2021
  • International Journal of Advanced Computer Science and Applications
  • Manasa R + 2 more

Action recognition involves the idea of localizing and classifying actions in a video over a sequence of frames. It can be thought of as an image classification task extended temporally. The information obtained over the multitude of frames is aggregated to comprehend the action classification output. Applications of action recognition systems range from assistance for healthcare systems to human-machine interaction. Action recognition has proven to be a challenging task as it poses many impediments including high computation cost, capturing extended context, designing complex architectures, and lack of benchmark datasets. Increasing the efficiency of algorithms in human action recognition can significantly improve the probability of implementing it in real-world scenarios. This paper has summarized the evolution of various action localization, classification, and detection algorithms applied to data from vision-based sensors. We have also reviewed the datasets that have been used for the action classification, localization, and detection process. We have further explored the areas of action classification, temporal and spatiotemporal action detection, which use convolution neural networks, recurrent neural networks, or a combination of both.

  • Book Chapter
  • Cite Count Icon 6
  • 10.1016/b978-0-32-385787-1.00019-1
Chapter 14 - Human activity recognition
  • Jan 1, 2022
  • Deep Learning for Robot Perception and Cognition
  • Lukas Hedegaard + 2 more

Chapter 14 - Human activity recognition

  • Research Article
  • Cite Count Icon 18
  • 10.1109/access.2020.2992740
Global Spatio-Temporal Attention for Action Recognition Based on 3D Human Skeleton Data
  • Jan 1, 2020
  • IEEE Access
  • Yun Han + 4 more

The human skeleton joints captured by RGB-D camera are widely used in action recognition for its robust and comprehensive 3D information. Presently, most action recognition methods based on skeleton joints treat all skeletal joints with the same importance spatially and temporally. However, the contributions of skeletal joints vary significantly. Hence, a GL-LSTM+Diff model is proposed to improve the recognition of human actions. A global spatial attention (GSA) model is proposed to express the different weights for different skeletal joints to provide precise spatial information for human action recognition. The accumulative learning curve (ALC) model is introduced to highlight which frames contribute most to the final decision making by giving varying temporal weights to each intermediate accumulated learning results. By integrating the proposed GSA (for spatial information) and ALC (for temporal processing) models into the LSTM framework and taking the human skeletal joints as inputs, a global spatio-temporal action recognition framework (GL-LSTM) is constructed to recognize human actions. Diff is introduced as the preprocessing method to enhance the dynamic of the features, thus to get distinguishable features in deep learning. Rigorous experiments on the largest dataset NTU RGB+D and the common small dataset SBU show that the algorithm proposed in this paper outperforms other state-of-the-art methods.

  • Research Article
  • Cite Count Icon 51
  • 10.1109/tcsvt.2018.2818151
Discriminative Spatio-Temporal Pattern Discovery for 3D Action Recognition
  • Apr 1, 2019
  • IEEE Transactions on Circuits and Systems for Video Technology
  • Junwu Weng + 3 more

Despite the recent success of 3D action recognition using depth sensor, most existing works target how to improve the action recognition performance, rather than understanding how different types of actions are performed. In this paper, we propose to discover discriminative spatio-temporal patterns for 3D action recognition. Discovering these patterns can not only help to improve the action recognition performance but also help us to understand and differentiate between the action category. Our proposed method takes the spatio-temporal structure of 3D action into consideration and can discover essential spatio-temporal patterns that play key roles in action recognition. Instead of relying on an end-to-end network to learn the 3D action representation and perform classification, we simply present each 3D action as a series of temporal stages composed by 3D poses. Then, we rely on nearest neighbor matching and bilinear classifiers to simultaneously identify both critical temporal stages and spatial joints for each action class. Despite using raw action representation and a linear classifier, experiments on five benchmark data sets show that the proposed spatio-temporal naive Bayes mutual information maximization can achieve a competitive performance compared with the state-of-the-art methods that use sophisticated end-to-end learning, and has the advantage of finding discriminative spatio-temporal action patterns.

  • Research Article
  • Cite Count Icon 3
  • 10.3390/s25103013
YOLO-Act: Unified Spatiotemporal Detection of Human Actions Across Multi-Frame Sequences.
  • May 10, 2025
  • Sensors (Basel, Switzerland)
  • Nada Alzahrani + 2 more

Automated action recognition has become essential in the surveillance, healthcare, and multimedia retrieval industries owing to the rapid proliferation of video data. This paper introduces YOLO-Act, a novel spatiotemporal action detection model that enhances the object detection capabilities of YOLOv8 to efficiently manage complex action dynamics within video sequences. YOLO-Act achieves precise and efficient action recognition by integrating keyframe extraction, action tracking, and class fusion. The model depicts essential temporal dynamics without the computational overhead of continuous frame processing by leveraging the adaptive selection of three keyframes representing the beginning, middle, and end of the actions. Compared with state-of-the-art approaches such as the Lagrangian Action Recognition Transformer (LART), YOLO-Act exhibits superior performance with a mean average precision (mAP) of 73.28 in experiments conducted on the AVA dataset, resulting in a gain of +28.18 mAP. Furthermore, YOLO-Act achieves this higher accuracy with significantly lower FLOPs, demonstrating its efficiency in computational resource utilization. The results highlight the advantages of incorporating precise tracking, effective spatial detection, and temporal consistency to address the challenges associated with video-based action detection.

  • Conference Article
  • Cite Count Icon 3
  • 10.22260/isarc2014/0087
Exploring Local Feature Descriptors for Construction Site Video Stabilization
  • Jul 8, 2014
  • Proceedings of the ... ISARC
  • Jung Yeol Kim + 1 more

Recent studies on automated activity analysis have adopted construction videos as an input data source to recognize and categorize construction workers’ actions. To ensure the representativeness of its analysis results, these videos have to be gathered randomly in terms of time and location. In doing so, such videos must be taken with hand-held cameras, a fact that inevitably leads to videos including jittery frames. Such frames can decrease the accuracy of automated activity analysis results. One area of the most recent and effective action recognition methods involves using spatio-temporal action recognition algorithms. The jittery frames, however, are fatal to the recognizing of a human worker’s action using such an algorithm. Jitters can be removed from the videos by using video stabilization technologies. The video stabilization is the pre-processing of action recognition for automated activity analysis. Regarding the video stabilization, local feature descriptor plays a major role in the stabilization process, and the correct selection of proper descriptor is critical. Therefore, the purpose of this study is to identify the best local feature descriptor for the video stabilization. This paper describes detail steps of the stabilization and provides performance analysis of various local feature descriptors in terms of stabilization of videos from construction site.

  • Conference Article
  • Cite Count Icon 8
  • 10.1109/cbmi50038.2021.9461922
A Study On the Effects of Pre-processing On Spatio-temporal Action Recognition Using Spiking Neural Networks Trained with STDP
  • Jun 28, 2021
  • Mireille El-Assal + 2 more

There has been an increasing interest in spiking neural networks in recent years. SNNs are seen as hypothetical solutions for the bottlenecks of ANNs in pattern recognition, such as energy efficiency [1]. But current methods such as ANN-to-SNN conversion and back-propagation do not take full advantage of these networks, and unsupervised methods have not yet reached a success comparable to advanced artificial neural networks. It is important to study the behavior of SNNs trained with unsupervised learning methods such as spike-timing dependent plasticity (STDP) on video classification tasks, including mechanisms to model motion information using spikes, as this information is critical for video understanding. This paper presents multiple methods of transposing temporal information into a static format, and then transforming the visual information into spikes using latency coding. These methods are paired with two types of temporal fusion known as early and late fusion, and are used to help the spiking neural network in capturing the spatio-temporal features from videos. In this paper, we rely on the network architecture of a convolutional spiking neural network trained with STDP, and we test the performance of this network when challenged with action recognition tasks. Understanding how a spiking neural network responds to different methods of movement extraction and representation can help reduce the performance gap between SNNs and ANNs. In this paper we show the effect of the similarity in the shape and speed of certain actions on action recognition with spiking neural networks, we also highlight the effectiveness of some methods compared to others.

  • Conference Article
  • Cite Count Icon 1
  • 10.1109/vcip.2013.6706382
Seeing actions through scene context
  • Nov 1, 2013
  • Hong-Bo Zhang + 5 more

Recognizing human actions is not alone, as hinted by the scene herein. In this paper, we investigate the possibility to boost the action recognition performance by exploiting their scene context associated. To this end, we model the scene as a mid-level “hidden layer” to bridge action descriptors and action categories. This is achieved via a scene topic model, in which hybrid visual descriptors including spatiotemporal action features and scene descriptors are first extracted from the video sequence. Then, we learn a joint probability distribution between scene and action by a Naive Bayesian N-earest Neighbor algorithm, which is adopted to jointly infer the action categories online by combining off-the-shelf action recognition algorithms. We demonstrate our merits by comparing to state-of-the-arts in several action recognition benchmarks.

  • Research Article
  • Cite Count Icon 12
  • 10.1109/tcad.2023.3241113
A Highly Compressed Accelerator With Temporal Optical Flow Feature Fusion and Tensorized LSTM for Video Action Recognition on Terminal Device
  • Oct 1, 2023
  • IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems
  • Peining Zhen + 4 more

Deep learning based action recognition has become ubiquitous in the video analysis area; however, large neural networks require enormous computations to achieve high performance, which hinder them from mobile applications that are tightly constrained by hardware resources. In this work, we introduce a highly compact and fast neural network based Action Recognition Accelerator named ARA on the terminal device. We build an LSTM based spatio-temporal action recognition model with extracted time-series features from RGB frames and flow features from optical flow fields. Then the LSTM based spatio-temporal model is deeply compressed with tensor decomposition to further reduce redundant parameters and lessen computation overhead. Based on the datasets UCF-11, UCF-101, and HMDB51, our proposed method achieves 95.87%, 94.08%, and 75.71% classification accuracy, being comparable with other state-of-the-art methods. In particular, our proposed method significantly compresses the parameter of the LSTM model 215× on the UCF-101 dataset. The proposed system can also achieve a fast running speed of 157.7 FPS on GPU. Furthermore, we validate the performance of the proposed system on an ARM-based terminal device; the results show it only has 0.017s latency and 4.73W power consumption.

  • Conference Article
  • Cite Count Icon 7
  • 10.1145/3394171.3416301
Toward Accurate Person-level Action Recognition in Videos of Crowed Scenes
  • Oct 12, 2020
  • Li Yuan + 8 more

Detecting and recognizing human action in videos with crowded scenes is a challenging problem due to the complex environment and diversity events. Prior works always fail to deal with this problem in two aspects: (1) lacking utilizing information of the scenes; (2) lacking training data in the crowd and complex scenes. In this paper, we focus on improving spatio-temporal action recognition by fully-utilizing the information of scenes and collecting new data. A top-down strategy is used to overcome the limitations. Specifically, we adopt a strong human detector to detect the spatial location of each frame. We then apply action recognition models to learn the spatio-temporal information from video frames on both the HIE dataset and new data with diverse scenes from the internet, which can improve the generalization ability of our model. Besides, the scenes information is extracted by the semantic segmentation model to assistant the process. As a result, our method achieved an average 26.05 wf\_mAP (ranking 1st place in the ACM MM grand challenge 2020: Human in Events).

Save Icon
Up Arrow
Open/Close