Abstract

In action recognition research, two primary types of information are appearance and motion information that is learned from RGB images through visual sensors. However, depending on the action characteristics, contextual information, such as the existence of specific objects or globally-shared information in the image, becomes vital information to define the action. For example, the existence of the ball is vital information distinguishing “kicking” from “running”. Furthermore, some actions share typical global abstract poses, which can be used as a key to classify actions. Based on these observations, we propose the multi-stream network model, which incorporates spatial, temporal, and contextual cues in the image for action recognition. We experimented on the proposed method using C3D or inflated 3D ConvNet (I3D) as a backbone network, regarding two different action recognition datasets. As a result, we observed overall improvement in accuracy, demonstrating the effectiveness of our proposed method.

Highlights

  • With the recent advent of deep learning, visual sensor-based action recognition technologies are being actively researched and used in a wide range of applications, e.g., person activity analysis [1], event detection [2], and video surveillance systems [3]

  • Taking advantage of using Mask-RCNN, we compare the overall accuracy between using the bounding box and mask as the input of the pairwise stream in Table 1 with the inflated 3D ConvNet (I3D)

  • In multi-stream fusion, the two-stream RGB and flow baseline was 97.33%; while the addition of pose stream increased the accuracy by 0.56% and the addition of pair stream increased it by 0.13%, respectively

Read more

Summary

Introduction

With the recent advent of deep learning, visual sensor-based action recognition technologies are being actively researched and used in a wide range of applications, e.g., person activity analysis [1], event detection [2], and video surveillance systems [3]. It is necessary to develop the way to incorporate and utilize both temporal and spatial information of video clips effectively. These two types of information, the appearance information and optical flows, are fully utilized by adopting the two-stream networks [4]. To solve this problem, there were previous studies [5,6,7,8]

Methods
Results
Conclusion

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.