A new CNN-LSTM architecture for activity recognition employing wearable motion sensor data: Enabling diverse feature extraction

Enes Koşar,Billur Barshan

doi:10.1016/j.engappai.2023.106529

Abstract

Extracting representative features to recognize human activities through the use of wearables is an area of on-going research. While hand-crafted features and machine learning (ML) techniques have been sufficiently well investigated in the past, the use of deep learning (DL) techniques is the current trend. Specifically, Convolutional Neural Networks (CNNs), Long Short Term Memory Networks (LSTMs), and hybrid models have been investigated. We propose a novel hybrid network architecture to recognize human activities through the use of wearable motion sensors and DL techniques. The LSTM and the 2D CNN branches of the model that run in parallel receive the raw signals and their spectrograms, respectively. We concatenate the features extracted at each branch and use them for activity recognition. We compare the classification performance of the proposed network with three single and three hybrid commonly used network architectures: 1D CNN, 2D CNN, LSTM, standard 1D CNN-LSTM, 1D CNN-LSTM proposed by Ordóñez and Roggen, and an alternative 1D CNN-LSTM model. We tune the hyper-parameters of six of the models using Bayesian optimization and test the models on two publicly available datasets. The comparison between the seven networks is based on four performance metrics and complexity measures. Because of the stochastic nature of DL algorithms, we provide the average values and standard deviations of the performance metrics over ten repetitions of each experiment. The proposed 2D CNN-LSTM architecture achieves the highest average accuracies of 95.66% and 92.95% on the two datasets, which are, respectively, 2.45% and 3.18% above those of the 2D CNN model that ranks the second. This improvement is a consequence of the proposed model enabling the extraction of a broader range of complementary features that comprehensively represent human activities. We evaluate the complexities of the networks in terms of the total number of parameters, model size, training/testing time, and the number of floating point operations (FLOPs). We also compare the results of the proposed network with those of recent related work that use the same datasets.

Full Text