Abstract

The existing view-based 3D object classification and recognition methods ignore the inherent hierarchical correlation and distinguishability of views, making it difficult to further improve the classification accuracy. In order to solve this problem, this paper proposes an end-to-end multi-view dual attention network framework for high-precision recognition of 3D objects. On one hand, we obtain three feature layers of query, key, and value through the convolution layer. The spatial attention matrix is generated by the key-value pairs of query and key, and each feature in the value of the original feature space branch is assigned different importance, which clearly captures the prominent detail features in the view, generates the view space shape descriptor, and focuses on the detail part of the view with the feature of category discrimination. On the other hand, a channel attention vector is obtained by compressing the channel information in different views, and the attention weight of each view feature is scaled to find the correlation between the target views and focus on the view with important features in all views. Integrating the two feature descriptors together to generate global shape descriptors of the 3D model, which has a stronger response to the distinguishing features of the object model and can be used for high-precision 3D object recognition. The proposed method achieves an overall accuracy of 96.6% and an average accuracy of 95.5% on the open-source ModelNet40 dataset, compiled by Princeton University when using Resnet50 as the basic CNN model. Compared with the existing deep learning methods, the experimental results demonstrate that the proposed method achieves state-of-the-art performance in the 3D object classification accuracy.

Highlights

  • The rapid development of 3D sensing technology has led to the development of depth cameras, laser scanners, depth scanners, and other 3D cameras and scanning equipment. 3D data acquisition is becoming increasingly convenient and accurate, promoting the continuous expansion of its application fields and scenes

  • The proposed method achieves an overall accuracy of 96.6% and an average accuracy of 95.5% on the open-source ModelNet40 dataset, compiled by Princeton University when using Resnet50 as the basic convolutional neural network (CNN) model

  • Based on the above analysis, we propose a multi-view dual attention network (MVDAN), as shown in Fig. 1, based on a view space attention block (VSAB) and view channel attention block (VCAB)

Read more

Summary

Introduction

The rapid development of 3D sensing technology has led to the development of depth cameras, laser scanners, depth scanners, and other 3D cameras and scanning equipment. 3D data acquisition is becoming increasingly convenient and accurate, promoting the continuous expansion of its application fields and scenes. Compared with multi-cameras, 3D sensor imaging devices such as depth cameras can capture a large amount of detailed 3D object structure information directly and conveniently [1]. Compared to other input methods, they are relatively low-dimensional, independent of complex 3D features, and robust to representation of 3D objects. They capture input views and are not limited to 3D data. MVCNN [11] treats all views and the use of max pooling to preserve the largest elements in a view can result in a loss of information, ignoring content relationships of views and distinguishability between them, which greatly limits the performance of view shape descriptors. Different views may not be effectively related, and the relative location information between views is ignored

Methods
Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call