General movement and pose assessment of infants is crucial for the early detection of cerebral palsy (CP). Nevertheless, most human pose estimation methods, in 2D or 3D, focus on adults due to the lack of large datasets and pose annotations on infants. To solve these problems, here we present a model known as YOLO-infantPose, which has been fine-tuned, for infant pose estimation in 2D. We further propose a self-supervised model called STAPose3D for 3D infant pose estimation based on videos. We employ multi-view video data during the training process as a strategy to address the challenge posed by the absence of 3D pose annotations. STAPose3D combines temporal convolution, temporal attention, and graph attention to jointly learn spatio-temporal features of infant pose. Our methods are summarized into two stages: applying YOLO-infantPose on input videos, followed by lifting these 2D poses along with respective confidences for every joint to 3D. The employment of the best-performing 2D detector in the first stage significantly improves the precision of 3D pose estimation. We reveal that fine-tuned YOLO-infantPose outperforms other models tested on our clinical dataset as well as two public datasets MINI-RGBD and YouTube-Infant dataset. Results from our infant movement video dataset demonstrate that STAPose3D effectively comprehends the spatio-temporal features among different views and significantly improves the performance of 3D infant pose estimation in videos. Finally, we explore the clinical application of our method for general movement assessment (GMA) in a clinical dataset annotated as normal writhing movements or abnormal monotonic movements according to the GMA standards. We show that the 3D pose estimation results produced by our STAPose3D model significantly boost the GMA prediction performance than 2D pose estimation. Our code is available at github.com/wwYinYin/STAPose3D.
Read full abstract