Recognition Benchmark Research Articles

Video Action Recognition (ViAR) aims to identify the category of the human action observed in a given video. With the advent of Deep Learning (DL) techniques, noticeable performance breakthroughs have been achieved in this study. However, the success of most existing DL-based ViAR methods heavily relies on the existence of a large amount of annotated data, i.e., videos with corresponding action categories. In practice, obtaining such a desired number of annotations is often difficult due to expensive labeling costs, which may lead to significant performance degradation for these methods. To address this issue, we propose an end-to-end semi-supervised Differentiated Auxiliary guided Network (DANet) to best use a few annotated videos. Except for the common supervised learning on a few annotated videos, the DANet also involves the knowledge of multiple pre-trained auxiliary networks to optimize the ViAR network in a self-supervised way on the unannotated data by removing the annotations. Considering the tight connection between video action recognition and classical static image-based visual tasks, the abundant knowledge from the pre-trained static image-based models can be used for training the ViAR model. Specifically, the DANet is a two-branch architecture, which includes a target branch of the ViAR network, and an auxiliary branch of multiple auxiliary networks (i.e., referring to diverse off-the-shelf models of relevant image tasks). Given a limited number of annotated videos, we train the target ViAR network end-to-end in a semi-supervised way, namely, with both the supervised cross-entropy loss on annotated videos, and the per-auxiliary weighted self-supervised contrastive losses on the same videos but without using annotations. Besides, we further explore different weighted guidance of the auxiliary networks to the ViAR network to better reflect different relationships between the image-based models and the ViAR model. Finally, we conduct extensive experiments on several popular action recognition benchmarks in comparison with existing state-of-the-art methods, and the experimental results demonstrate the superiority of DANet over most of the compared methods. In particular, the DANet obviously suppresses state-of-the-art ViAR methods even with very fewer annotated videos.

Read full abstract

Spatio-temporal pattern recognition is a fundamental ability of the brain which is required for numerous real-world activities. Recent deep learning approaches have reached outstanding accuracies in such tasks, but their implementation on conventional embedded solutions is still very computationally and energy expensive. Tactile sensing in robotic applications is a representative example where real-time processing and energy efficiency are required. Following a brain-inspired computing approach, we propose a new benchmark for spatio-temporal tactile pattern recognition at the edge through Braille letter reading. We recorded a new Braille letters dataset based on the capacitive tactile sensors of the iCub robot's fingertip. We then investigated the importance of spatial and temporal information as well as the impact of event-based encoding on spike-based computation. Afterward, we trained and compared feedforward and recurrent Spiking Neural Networks (SNNs) offline using Backpropagation Through Time (BPTT) with surrogate gradients, then we deployed them on the Intel Loihi neuromorphic chip for fast and efficient inference. We compared our approach to standard classifiers, in particular to the Long Short-Term Memory (LSTM) deployed on the embedded NVIDIA Jetson GPU, in terms of classification accuracy, power, and energy consumption together with computational delay. Our results show that the LSTM reaches ~97% of accuracy, outperforming the recurrent SNN by ~17% when using continuous frame-based data instead of event-based inputs. However, the recurrent SNN on Loihi with event-based inputs is ~500 times more energy-efficient than the LSTM on Jetson, requiring a total power of only ~30 mW. This work proposes a new benchmark for tactile sensing and highlights the challenges and opportunities of event-based encoding, neuromorphic hardware, and spike-based computing for spatio-temporal pattern recognition at the edge.

Read full abstract

Recognition Benchmark Research Articles

Related Topics

Articles published on Recognition Benchmark

Robust transformer with locality inductive bias and feature normalization

A novel hybrid machine learning approach for traffic sign detection using CNN-GRNN

Egypt Monuments Dataset version 1: A Scalable Benchmark for Image Classification and Monument Recognition

MAR20： A benchmark for military aircraft recognition in remote sensing images

Transductive Prototypical Attention Reasoning Network for Few-Shot SAR Target Recognition

Attribute-Guided Generative Adversarial Network With Improved Episode Training Strategy for Few-Shot SAR Image Generation

Contrastive Feature Disentangling for Partial Aspect Angles SAR Noncooperative Target Recognition

Joint Holistic and Masked Face Recognition

Knowledge-aware Global Reasoning for Situation Recognition.

Collaborative Multilingual Continuous Sign Language Recognition: A Unified Framework

Wave interference network with a wave function for traffic sign recognition.

Robust tactile object recognition in open-set scenarios using Gaussian prototype learning.

Efficient Transformer-Based Compressed Video Modeling via Informative Patch Selection.

SeVuc: A study on the Security Vulnerabilities of Capsule Networks against adversarial attacks

Lightweight Semantic-Guided Neural Networks Based on Single Head Attention for Action Recognition

Classification by Principal Component Regression in the Real and Hypercomplex Domains

Traffic sign classification using CNN and detection using faster-RCNN and YOLOV4

Multi-Task Learning for Scene Text Image Super-Resolution with Multiple Transformers

DANet: Semi-supervised differentiated auxiliaries guided network for video action recognition

Braille letter reading: A benchmark for spatio-temporal pattern recognition on neuromorphic hardware.

Lead the way for us

Editage

Paperpal

R Discovery

Mind the Graph

Recognition Benchmark Research Articles

Related Topics

Articles published on Recognition Benchmark

Robust transformer with locality inductive bias and feature normalization

A novel hybrid machine learning approach for traffic sign detection using CNN-GRNN

Egypt Monuments Dataset version 1: A Scalable Benchmark for Image Classification and Monument Recognition

MAR20： A benchmark for military aircraft recognition in remote sensing images

Transductive Prototypical Attention Reasoning Network for Few-Shot SAR Target Recognition

Attribute-Guided Generative Adversarial Network With Improved Episode Training Strategy for Few-Shot SAR Image Generation

Contrastive Feature Disentangling for Partial Aspect Angles SAR Noncooperative Target Recognition

Joint Holistic and Masked Face Recognition

Knowledge-aware Global Reasoning for Situation Recognition.

Collaborative Multilingual Continuous Sign Language Recognition: A Unified Framework

Wave interference network with a wave function for traffic sign recognition.

Robust tactile object recognition in open-set scenarios using Gaussian prototype learning.

Efficient Transformer-Based Compressed Video Modeling via Informative Patch Selection.

SeVuc: A study on the Security Vulnerabilities of Capsule Networks against adversarial attacks

Lightweight Semantic-Guided Neural Networks Based on Single Head Attention for Action Recognition

Classification by Principal Component Regression in the Real and Hypercomplex Domains

Traffic sign classification using CNN and detection using faster-RCNN and YOLOV4

Multi-Task Learning for Scene Text Image Super-Resolution with Multiple Transformers

DANet: Semi-supervised differentiated auxiliaries guided network for video action recognition

Braille letter reading: A benchmark for spatio-temporal pattern recognition on neuromorphic hardware.