Abstract
Inspired from the promising performances achieved by recurrent neural networks (RNN) and convolutional neural networks (CNN) in action recognition based on skeleton, this paper presents a deep network structure which combines both CNN for classification and RNN to achieve attention mechanism for human interaction recognition. Specifically, the attention module in this structure is utilized to give various levels of attention to various frames by different weights, and the CNN is employed to extract the high-level spatial and temporal information of skeleton data. These two modules seamlessly form a single network architecture. In addition, to eliminate the impact of different locations and orientations, a coordinate transformation is conducted from the original coordinate system to the human-centric coordinate system. Furthermore, three different features are extracted from the skeleton data as the inputs of three subnetworks, respectively. Eventually, these subnetworks fed with different features are fused as an integrated network. The experimental result shows the validity of the proposed approach on two widely used human interaction datasets.
Highlights
Human action recognition and interaction recognition have recently attracted the intensive attention of researchers in computer vision field due to its extensive application prospects, such as intelligent surveillance, human-machine interaction, and so on
This dataset is a human interaction recognition dataset captured by Kinect datasets: SBU Interaction Dataset [9] and NTU RGB + D Dataset [10]
The human-centric coordinate system could eliminate this influence of different perspectives of actions, which verifies the availability of coordinate transformation
Summary
Human action recognition and interaction recognition have recently attracted the intensive attention of researchers in computer vision field due to its extensive application prospects, such as intelligent surveillance, human-machine interaction, and so on. Most previous methods are devoted to the human action recognition in two-dimensional RGB data [1,2]. Due to the high sensitivity to environmental variability of the RGB data, precise action recognition is a challenging task. In [3], Rezazadegan et al proposed an action region proposal method that, informed by optical flow to extract image regions likely to contain actions, which can eliminate the influence of background. This problem could be overcome by using cost-efficient RGB-D (i.e., color plus depth) sensors [5]
Published Version (Free)
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.