AbstractHead pose estimation is an especially challenging task due to the complexity nonlinear mapping from 2D feature space to 3D pose space. To address the above issue, this paper presents a novel and efficient head pose estimation framework based on particle swarm optimized contrastive learning and multimodal entangled graph convolution network. Firstly, a new network, the region and difference‐aware feature pyramid network (RD‐FPN), is proposed for 2D keypoints detection to alleviate the background interference and enhance the feature expressiveness. Then, particle swarm optimized contrastive learning is constructed to alternatively match 2D and 3D keypoints, which takes the multimodal keypoints matching accuracy as the optimization objective, while considering the similarity of cross‐modal positive and negative sample pairs from contrastive learning as a local contrastive constraint. Finally, multimodal entangled graph convolution network is designed to enhance the ability of establishing geometric relationships between keypoints and head pose angles based on second‐order bilinear attention, in which point‐edge attention is introduced to improve the representation of geometric features between multimodal keypoints. Compared with other methods, the average error of our method is reduced by 8.23%, indicating the accuracy, generalization, and efficiency of our method on the 300W‐LP, AFLW2000, BIWI datasets.
Read full abstract