Abstract

The objective of this work is to reconstruct the 3D surfaces of sculptures from one or more images using a view-dependent representation. To this end, we train a network, SiDeNet, to predict the Silhouette and Depth of the surface given a variable number of images; the silhouette is predicted at a different viewpoint from the inputs (e.g. from the side), while the depth is predicted at the viewpoint of the input images. This has three benefits. First, the network learns a representation of shape beyond that of a single viewpoint, as the silhouette forces it to respect the visual hull, and the depth image forces it to predict concavities (which don’t appear on the visual hull). Second, as the network learns about 3D using the proxy tasks of predicting depth and silhouette images, it is not limited by the resolution of the 3D representation. Finally, using a view-dependent representation (e.g. additionally encoding the viewpoint with the input image) improves the network’s generalisability to unseen objects. Additionally, the network is able to handle the input views in a flexible manner. First, it can ingest a different number of views during training and testing, and it is shown that the reconstruction performance improves as additional views are added at test-time. Second, the additional views do not need to be photometrically consistent. The network is trained and evaluated on two synthetic datasets—a realistic sculpture dataset (SketchFab), and ShapeNet. The design of the network is validated by comparing to state of the art methods for a set of tasks. It is shown that (i) passing the input viewpoint (i.e. using a view-dependent representation) improves the network’s generalisability at test time. (ii) Predicting depth/silhouette images allows for higher quality predictions in 2D, as the network is not limited by the chosen latent 3D representation. (iii) On both datasets the method of combining views in a global manner performs better than a local method. Finally, we show that the trained network generalizes to real images, and probe how the network has encoded the latent 3D shape.

Highlights

  • Learning to infer the 3D shape of complex objects given only a few images is one of the grand challenges of computer vision

  • The evaluation measure used is the intersection over union (IoU) error for the silhouette, L1 error for the depth error, and chamfer distance for the error when evaluating in 3D

  • Silhouettes + depth Depth Depth Silhouettes 3D. This comparison is only done on ShapeNet as for the Sculpture dataset we found it was necessary to subtract off the mean depth to predict high quality depth maps (Sect. 4.2)

Read more

Summary

Introduction

Learning to infer the 3D shape of complex objects given only a few images is one of the grand challenges of computer vision. Learning classic work of Blanz and Vetter (1999) for faces and later for other classes such as semantic categories (Kar et al 2015; Cashman and Fitzgibbon 2013) or cuboidal room structures (Fouhey 2015; Hedau et al 2009). This work extends this area in two directions: first, it considers 3D shape inference from multiple images rather than a single one (though this is considered as well); second, it considers the quite generic class of piecewise smooth textured sculptures and the associated challenges. The views need not be International Journal of Computer Vision (2019) 127:1780–1800

Objectives
Methods
Results
Discussion
Conclusion
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.