Abstract

In socially assistive robotics, human activity recognition plays a central role when the adaptation of the robot behavior to the human one is required. In this paper, we present an activity recognition approach for activities of daily living based on deep learning and skeleton data. In the literature, ad hoc features extraction/selection algorithms with supervised classification methods have been deployed, reaching an excellent classification performance. Here, we propose a deep learning approach, combining CNN and LSTM, that exploits both the learning of spatial dependencies correlating the limbs in a skeleton 3D grid representation and the learning of temporal dependencies from instances with a periodic pattern that works on raw data and so without requiring an explicit feature extraction process. These models are proposed for real-time activity recognition, and they are tested on the CAD-60 dataset. Results show that the proposed model behaves better than an LSTM model thanks to the automatic features extraction of the limbs’ correlation. “New Person” results show that the CNN-LSTM model achieves 95.4% of precision and 94.4% of recall, while the “Have Seen” results are 96.1% of precision and 94.7% of recall.

Highlights

  • Personal service robotics applications are already available on the market to be used in human-populated environments such as working, public and domestic ones

  • We investigate the possibility of training the recognition module considering both spatial dependencies due to the relationships among the RGB-D skeleton joints by the use of a convolutional neural network (CNN) and the temporal patterns of the activities by the use of an LSTM

  • Unlike the approaches applied on the CAD-60 dataset that select and extract features manually, we propose a deep learning model for automatic feature extraction that uses CNNs to extract spatial dependencies from human poses and LSTMs to extract temporal dependencies between poses

Read more

Summary

Introduction

Personal service robotics applications are already available on the market to be used in human-populated environments such as working, public and domestic ones. Taking inspiration from the work of [6,20], where the authors propose a spatiotemporal classification, respectively, for video description from images and activity recognition from wearable devices data, here we aim at achieving the same results by combining the use of CNNs with LSTM gaining benefits of both spatial and temporal learning. Following this idea, we investigate the possibility of training the recognition module considering both spatial dependencies due to the relationships among the RGB-D skeleton joints by the use of a CNN and the temporal patterns of the activities by the use of an LSTM. When comparing the performance on the whole duration of a video the approach performs as the other state-of-the-art approaches

Related works
The proposed approach
Experimental evaluation
Dataset
Data preprocessing
Model settings
Classification results
The code is available upon request
LSTM results
CNN-LSTM results
Statistical hypothesis test
Window size results
Comparison with the SoA
Real setting configuration
Findings
Conclusions
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call