Abstract
In socially assistive robotics, human activity recognition plays a central role when the adaptation of the robot behavior to the human one is required. In this paper, we present an activity recognition approach for activities of daily living based on deep learning and skeleton data. In the literature, ad hoc features extraction/selection algorithms with supervised classification methods have been deployed, reaching an excellent classification performance. Here, we propose a deep learning approach, combining CNN and LSTM, that exploits both the learning of spatial dependencies correlating the limbs in a skeleton 3D grid representation and the learning of temporal dependencies from instances with a periodic pattern that works on raw data and so without requiring an explicit feature extraction process. These models are proposed for real-time activity recognition, and they are tested on the CAD-60 dataset. Results show that the proposed model behaves better than an LSTM model thanks to the automatic features extraction of the limbs’ correlation. “New Person” results show that the CNN-LSTM model achieves 95.4% of precision and 94.4% of recall, while the “Have Seen” results are 96.1% of precision and 94.7% of recall.
Highlights
Personal service robotics applications are already available on the market to be used in human-populated environments such as working, public and domestic ones
We investigate the possibility of training the recognition module considering both spatial dependencies due to the relationships among the RGB-D skeleton joints by the use of a convolutional neural network (CNN) and the temporal patterns of the activities by the use of an LSTM
Unlike the approaches applied on the CAD-60 dataset that select and extract features manually, we propose a deep learning model for automatic feature extraction that uses CNNs to extract spatial dependencies from human poses and LSTMs to extract temporal dependencies between poses
Summary
Personal service robotics applications are already available on the market to be used in human-populated environments such as working, public and domestic ones. Taking inspiration from the work of [6,20], where the authors propose a spatiotemporal classification, respectively, for video description from images and activity recognition from wearable devices data, here we aim at achieving the same results by combining the use of CNNs with LSTM gaining benefits of both spatial and temporal learning. Following this idea, we investigate the possibility of training the recognition module considering both spatial dependencies due to the relationships among the RGB-D skeleton joints by the use of a CNN and the temporal patterns of the activities by the use of an LSTM. When comparing the performance on the whole duration of a video the approach performs as the other state-of-the-art approaches
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have