Skeleton-based human activity recognition using ConvLSTM and guided feature learning

Santosh Kumar Yadav,Shaik Ali Akbar,Kamlesh Tiwari,Hari Mohan Pandey

doi:10.1007/s00500-021-06238-7

Abstract

Human activity recognition aims to determine actions performed by a human in an image or video. Examples of human activity include standing, running, sitting, sleeping, etc. These activities may involve intricate motion patterns and undesired events such as falling. This paper proposes a novel deep convolutional long short-term memory (ConvLSTM) network for skeletal-based activity recognition and fall detection. The proposed ConvLSTM network is a sequential fusion of convolutional neural networks (CNNs), long short-term memory (LSTM) networks, and fully connected layers. The acquisition system applies human detection and pose estimation to pre-calculate skeleton coordinates from the image/video sequence. The ConvLSTM model uses the raw skeleton coordinates along with their characteristic geometrical and kinematic features to construct the novel guided features. The geometrical and kinematic features are built upon raw skeleton coordinates using relative joint position values, differences between joints, spherical joint angles between selected joints, and their angular velocities. The novel spatiotemporal-guided features are obtained using a trained multi-player CNN-LSTM combination. Classification head including fully connected layers is subsequently applied. The proposed model has been evaluated on the KinectHAR dataset having 130,000 samples with 81 attribute values, collected with the help of a Kinect (v2) sensor. Experimental results are compared against the performance of isolated CNNs and LSTM networks. Proposed ConvLSTM have achieved an accuracy of 98.89% that is better than CNNs and LSTMs having an accuracy of 93.89 and 92.75%, respectively. The proposed system has been tested in realtime and is found to be independent of the pose, facing of the camera, individuals, clothing, etc. The code and dataset will be made publicly available.

Highlights

The basic aim of human activity recognition systems is to automatically recognize the activities of an individual with the raw data obtained from sensors
We have experimentally found that using the combination of convolutional neural networks (CNNs) and long short-term memory (LSTM) in a serial manner results in better efficiency as compared to using either of them individually, or using it in a parallel mode
Experimental results show that the ConvLSTM model achieves better accuracy (98.89%) as compared to the LSTM (92.75%) and CNNs (93.89%) individually

Summary

Introduction

The basic aim of human activity recognition systems is to automatically recognize the activities of an individual with the raw data obtained from sensors. The application of activity detection can be found in many areas like humancomputer interaction, video surveillance, sports analysis, video understanding, etc.(Poppe 2010; Weinland et al 2011; Li et al 2019). Monitoring fall detection and early reporting is an important application of human activity recognition. The world population is expected to have a 25% increase in the elder population by 2050, it is necessary to assist elderly adults over the age of 65 (DeSA et al 2013). Fall is a major cause of an accident and even death, especially in the case of the elderly. An estimated $31 billion is spent on direct medical costs for fall injuries in the US (Stevens et al 2006), making fall prevention and early reporting necessary

Methods

Results

Conclusion