In the field of 3D human pose estimation for in-bed health monitoring in real-world scenarios, challenging problems arise due to varying conditions of illumination and depth ambiguity. To address these issues, we followed the principle of 2D-to-3D pipeline and developed a novel 3D skeletons tracking approach based on multi-source image fusion. The utilization of thermal images allows us to estimate the 2D pose, as they are unaffected by changes in illumination and can be used even in complete darkness. Furthermore, depth images are employed to infer the pose depth, benefiting from their excellent depth resolution. During the 2D pose estimation phase, in order to exploit both spatial and temporal information, we have proposed a module for 2D pose estimation based on Graph Convolutional Network and Transformer network. To train this module, we have created a dataset using an independently developed automatic annotation method. In the subsequent 3D pose estimation phase, the typical 2D-to-3D lifting approaches entail inferring a 3D human pose from an intermediate 2D pose estimated by a deep network, but the task is an ill-problem due to the lack of depth resolution in thermal images. In our study, we overcome this challenge by inferring the pose depth through fusion with depth images, which offer superior depth resolution but are subject to limitations posed by illumination and texture conditions. We found that, (i) the developed automatic annotation method yields accurate pose annotations for training 2D pose estimation based on infrared thermal images; (ii) on the proposed dataset, our 2D pose estimation framework achieved a mean error of below 4 pixels and outperformed state-of-the-art methods; (iii) it achieved precise 3D pose estimation via fusion with depth images; and (iv) future research should consider applying the outlined approach to support the development of novel human pose models for health monitoring.
Read full abstract