Abstract

Most Graph Convolutional Networks based 3D human pose estimation (HPE) methods were involved in single-view 3D HPE and utilized certain spatial graphs, existing key problems such as depth ambiguity, insufficient feature representation, or limited receptive fields. To address these issues, we propose a multi-view 3D HPE framework based on deep semantic graph transformer, which adaptively learns and fuses multi-view significant semantic features of human nodes to improve 3D HPE performance. First, we propose a deep semantic graph transformer encoder to enrich spatial feature information. It deeply mines the position, spatial structure, and skeletal edge knowledge of joints and dynamically learns their correlations. Then, we build a progressive multi-view spatial-temporal feature fusion framework to mitigate joint depth uncertainty. To enhance the pose spatial representation, deep spatial semantic feature are interacted and fused across different viewpoints during monocular feature extraction. Furthermore, long-time relevant temporal dependencies are modeled and spatial-temporal information from all viewpoints is fused to intermediately supervise the depth. Extensive experiments on three 3D HPE benchmarks show that our method achieves state-of-the-art results. It can effectively enhance pose features, mitigate depth ambiguity in single-view 3D HPE, and improve 3D HPE performance without providing camera parameters. Codes and models are available at https://github.com/z0911k/SGraFormer.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call