Artificial intelligence has advanced the applications of sensor-based human motion capture and recognition technology in various engineering fields, such as human–robot collaboration and health monitoring. Deep learning methods can achieve satisfactory recognition results when provided with sufficient labeled data. However, labeling a large dataset is expensive and time-consuming in practical applications. To address this issue, this paper proposes a deep convolutional transformer-based contrastive self-supervised (DCTCSS) model under the bootstrap your own latent (BYOL) framework. The DCTCSS model aims to achieve reliable activity recognition using only a small amount of labeled data. Firstly, a deep convolutional transformer (DCT) model is proposed as the backbone of DCTCSS model, to learn high-level feature representations from unlabeled data in pre-training period. Subsequently, a simple linear classifier is trained with supervised fine-tuning using a limited amount of labeled data to recognize activities. In addition, this paper experimentally formulates a random data augmentation strategy to increase the diversity of input data. The performance of the DCTCSS model is evaluated and compared with several state-of-the-art algorithms on three datasets widely used in daily life, medical monitoring, and intelligent manufacturing. Experimental results show that the DCTCSS model achieves mean F1 scores of 95.64%, 88.39%, and 98.40% on the UCI-HAR, Skoda, and Mhealth datasets, respectively, using only 10% of the labeled data. These results outperform both supervised and unsupervised baseline models. Consequently, the DCTCSS model demonstrates its effectiveness in reducing the dependence on large amounts of labeled data while still achieving competitive activity recognition performance.
Read full abstract