Although multi-view human pose and shape regression methods have information from other views for complementing and correcting, existing ones still have its own drawback of not fully taking advantage of multi-view setup. Thus they are far from efficiently aligning and merging features in different views. In order to tackle these problems, we propose a multi-view framework where features from all views are well aligned and merged through multi-view voxel aggregation with inverse projection. Our framework highlights three major characteristics. Firstly, we use a multi-view volumetric aggregation module for better prediction by exploiting various information in different-scale feature maps. Secondly, in our framework, instead of using all voxels, a mesh-aligned voxel selection module is proposed to make effective prediction by eliminating redundant background voxels. Lastly, the framework further improves the performance of human body parametric modeling by adopting a dual-branch strategy, where one branch for parametric human model prediction and the other for 3D keypoints prediction. Their mutual influence is critical to the improvement for both tasks. Additionally, we find the scarcity of datasets also hinders the development of multi-view methods, so we propose a approach for creating occlusion datasets specifically for multi-view occlusion case. Experimental results verify the effectiveness of the proposed framework on two benchmarks, Human3.6M and MPI-INF-3DHP.
Read full abstract