Three-stream spatio-temporal attention network for first-person action and interaction recognition

Javed Imran,Balasubramanian Raman

doi:10.1007/s12652-021-02940-4

Abstract

The problem of action and interaction recognition of human activities from the perspective of first-person view-point is an interesting area of research in the field of human action recognition (HAR). This paper presents a data-driven spatio-temporal network to combine different modalities computed from first-person videos using a temporal attention mechanism. First, our proposed approach uses three-stream inflated 3D ConvNet (I3D) to extract low-level features from RGB frame difference (FD), optical flow (OF) and magnitude-orientation (MO) streams. An I3D network has the advantage to directly learn spatio-temporal features over short video snippets (like 16 frames). Second, the extracted features are fused together and fed to a Bidirectional long short-term memory (BiLSTM) network to model high-level temporal feature sequences. Third, we propose to incorporate attention mechanism with our BiLSTM network to automatically select the most relevant temporal snippets in the given video sequence. Finally, we conducted extensive experiments and achieve state-of-the-art results on JPL (98.5%), NUS (84.1%), UTK (91.5%) and DogCentric (83.3%) datasets. These results show that features extracted from three-stream network are complementary to each other, and attention mechanism further improves the results by a large margin than previous attempts based on handcrafted and deep features.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Three-stream spatio-temporal attention network for first-person action and interaction recognition

Abstract

Talk to us

Similar Papers

More From: Journal of Ambient Intelligence and Humanized Computing

Lead the way for us

Journal: Journal of Ambient Intelligence and Humanized Computing	Publication Date: Feb 17, 2021
Citations: 3

Similar Papers

Intelligent human action recognition using an ensemble model of evolving deep networks with swarm-based optimization
Li Zhang ... Yonghong Yu
Knowledge-Based Systems | VOL. 220
Li Zhang, et. al.Li Zhang ... Yonghong Yu
05 Mar 2021
Knowledge-Based Systems | VOL. 220

A tutorial on human activity recognition using body-worn inertial sensors
Andreas Bulling ... Bernt Schiele
ACM Computing Surveys | VOL. 46
Andreas Bulling, et. al.Andreas Bulling ... Bernt Schiele
01 Jan 2014
ACM Computing Surveys | VOL. 46

Assessing the State of Self-Supervised Human Activity Recognition Using Wearables
Harish Haresamudram ... Thomas Plötz
Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies | VOL. 6
Harish Haresamudram, et. al.Harish Haresamudram ... Thomas Plötz
06 Sep 2022
Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies | VOL. 6

Research on a Real-Time Monitoring Method for the Wear State of a Tool Based on a Convolutional Bidirectional LSTM Model
Qipeng Chen ... Haisong Huang
Symmetry | VOL. 11
Qipeng Chen, et. al.Qipeng Chen ... Haisong Huang
02 Oct 2019
Symmetry | VOL. 11

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Three-stream spatio-temporal attention network for first-person action and interaction recognition

Abstract

Talk to us

Similar Papers

More From: Journal of Ambient Intelligence and Humanized Computing