Abstract
Text, 2D, and 3D information are crucial information representations in modern science and management disciplines. However, complex and irregular 3D data produce data scarcity and expensive generation that limit their processing and application. In this paper, we present MeshCLIP, a new cross-modal information learning paradigm to directly process 3D mesh data end-to-end in a zero/few-shot manner. Specifically, we design a novel pipeline based on visual factors and graphics principles, bridging the gap between 3D mesh data and other modal data, thereby joining 2/3D visual and textual information for zero/few-shot learning. Then, we construct a self-attention adapter for 3D mesh key information learning when training only a few priors, significantly improving the model’s discriminative ability. Extensive experiments demonstrate that the proposed MeshCLIP can achieve state-of-the-art results on multiple challenging 3D mesh datasets. In the whole 3D domain, the proposed zero-shot approach significantly outperforms the existing other 3D representation methods with an accuracy 3 × better ( increased by 41.5%) on the ModelNet40 dataset. Furthermore, in few-shot learning, the proposed MeshCLIP uses only a few supervised priors (only less than 10% of the sample size) to achieve results close to those of methods trained on a full dataset.
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.