Abstract

Human action segmentation and recognition from the continuous untrimmed sensor data stream is a challenging issue known as temporal action detection. This article provides a two-stream You Only Look Once-based network method, which fuses video and skeleton streams captured by a Kinect sensor, and our data encoding method is used to turn the spatiotemporal temporal action detection into a one-dimensional object detection problem in constantly augmented feature space. The proposed approach extracts spatial–temporal three-dimensional convolutional neural network features from video stream and view-invariant features from skeleton stream, respectively. Furthermore, these two streams are encoded into three-dimensional feature spaces, which are represented as red, green, and blue images for subsequent network input. We proposed the two-stream You Only Look Once-based networks which are capable of fusing video and skeleton information by using the processing pipeline to provide two fusion strategies, boxes-fusion or layers-fusion. We test the temporal action detection performance of two-stream You Only Look Once network based on our data set High-Speed Interplanetary Tug/Cocoon Vehicles-v1, which contains seven activities in the home environment and achieve a particularly high mean average precision. We also test our model on the public data set PKU-MMD that contains 51 activities, and our method also has a good performance on this data set. To prove that our method can work efficiently on robots, we transplanted it to the robotic platform and an online fall down detection experiment.

Highlights

  • In modern society, more and more people spend their elderly life living alone

  • We propose a novel idea that innovatively transforms the temporal action detection (TAD) problem into a 1D object detection problem based on a two-stream YOLO-based deep learning network that fuses the encoded video and skeleton stream of Kinect sensor

  • We proposed a TAD deep learning network, which incorporates video and skeleton information from Kinect sensor

Read more

Summary

Introduction

More and more people spend their elderly life living alone. These phenomena may cause a potential risk due to the inadequate ability of the elder people to take good care of themselves. In skeleton-stream, we apply view-invariant transform on skeleton sequence to extract view-invariant feature (VIF), which may eliminate view change influence and retain more relative motions than some traditional skeleton-based methods.[18] The skeleton VIF sequence is encoded into color map whose width indicates the time line and height is equal to the number of skeletons, preserving the temporal information and the spatial information of original action. YOLO-based network that provides two information fusion strategies, boxes-fusion and layers-fusion, for TAD of color maps encoded from two input streams. We propose a novel idea that innovatively transforms the TAD problem into a 1D object detection problem based on a two-stream YOLO-based deep learning network that fuses the encoded video and skeleton stream of Kinect sensor.

Related works
Experiments
Methods
Conclusion
Findings
Declaration of conflicting interests

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.