Abstract

3D object recognition is a fundamental task in 3D computer vision. View-based methods have received considerable attention due to their high efficiency and superior performance. To better capture the long-range dependencies among multi-view images, Transformer has recently been introduced into view-based 3D object recognition and achieved excellent performance. However, the information among views on multiple scales is not utilized sufficiently in the existing Transformer-based methods. To address this limitation, we proposed a 3D object recognition method named iMVS to integrate Multi-View information on multiple Scales. Specifically, for the single-view image/features at each scale, we adopt a hybrid feature extraction module consisting of CNN and Transformer to jointly capture local and non-local information. For the extracted multi-view image features at each scale, we develop a feature transfer module including a view Transformer block to achieve the information transfer across views. Following a sequential process of the single-view feature extraction and multi-view feature transfer on multiple scales, the multi-view information is sufficiently interacted. Subsequently, the multi-scale features with multi-view information are fed into our designed feature aggregation module to generate a category-specific descriptor, where the adopted channel Transformer block facilitates the descriptor to be more expressive. Coupling with these designs, our method can fully exploit the information embedded within multi-view images. Experimental results on ModelNet40, ModelNet10 and a real-world dataset MVP-N demonstrate the superior performance of our method.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call