Abstract

Learning human 2D-3D correspondences aims to map all human 2D pixels to a 3D human template, namely human densepose estimation, involving surface patch recognition (i.e., Index-to-Patch (I)) and regression of patch-specific UV coordinates. Despite recent progress, it remains challenging especially under the condition of “in the wild”, where RGB images capture real-world scenes with backgrounds, occlusions, scale variations, and postural diversity. In this paper, we address three vital problems in this task: 1) how to perceive multi-scale visual information for instances “in the wild”; 2) how to design learning objectives to address the precise instance representation harassed by “multiple instances in one bounding box” phenomenon; and 3) how to boost the performance of index-to-patch prediction faced by limited supervision. To tackle problems above, we propose an end-to-end deep Adaptive Multi-path Aggregation network (AMA-net) for Human DensePose Estimation. First, we introduce an adaptive multi-path aggregation algorithm to extract varying-sized instance-level features, which capture multi-scale information of a bounding-box and are then utilized for parsing different instances. Second, we adopt an instance augmentation learning objective to further distinguish the target instance from other interference instances. Third, taking advantage of 2D human parsers that are trained from sufficient annotations, we introduce a task transformer that bridges the “gap” between 2D human parsing and densepose estimation, thus benefiting the performance of densepose estimator. Experimental results on the challenging DensePose-COCO dataset demonstrate that our approach sets a new record, and it significantly outperforms the state-of-the-art methods. Codes and models are publicly available.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call