Smartphone-based Human Activity Recognition (HAR) identifies human movements using inertial signals gathered from multiple smartphone sensors. Generally, these signals are stacked as one (data-level fusion) and fed into deep learning algorithms for feature extractions. This research studies feature-level fusion, individually processing inertial signals from each sensor, and proposes a lightweight deep temporal learning model, Feature-Level Fusion Multi-Sensor Aggregation Temporal Network (FLF-MSATN), that performs feature extraction on inertial signals from each sensor separately. The raw signals, segmented into equally sized time windows, are passed into individual Dilated-Pooled Convolutional Heads (DPC Heads) for temporal feature analysis. Each DPC Head has a spatiotemporal block containing dilated causal convolutions and average pooling, to extract underlying patterns. The DPC Heads’ outputs are concatenated and passed into a Global Average Pooling layer to generate a condensed confidence map before activity classification. FLF-MSATN is assessed using a subject-independent protocol on a publicly available HAR dataset, UCI HAR, and a self-collected HAR dataset, achieving 96.67% and 82.70% accuracies, respectively. A Data-Level Fusion MSATN is built to compare and verify the model performance attained by the proposed FLF-MSATN. The empirical results show that implementing FLF-MSATN enhances the accuracy by ~3.4% for UCI HAR and ~9.68% for self-collected datasets.
Read full abstract