Abstract

Recently, both single modality and cross modality near-duplicate image detection tasks have received wide attention in the community of pattern recognition and computer vision. Existing deep neural networks-based methods have achieved remarkable performance in this task. However, most of the methods mainly focus on the learning of each image from the image pair, thus leading to less use of the information between the near duplicate image pairs to some extent. In this paper, to make more use of the correlations between image pairs, we propose a spatial transformer comparing convolutional neural network (CNN) model to compare near-duplicate image pairs. Specifically, we firstly propose a comparing CNN framework, which is equipped with a cross-stream to fully learn the correlation information between image pairs, while considering the features of each image. Furthermore, to deal with the local deformations led by cropping, translation, scaling, and non-rigid transformations, we additionally introduce a spatial transformer comparing CNN model by incorporating a spatial transformer module to the comparing CNN architecture. To demonstrate the effectiveness of the proposed method on both the single-modality and cross-modality (Optical-InfraRed) near-duplicate image pair detection tasks, we conduct extensive experiments on three popular benchmark datasets, namely CaliforniaND (ND means near duplicate), Mir-Flickr Near Duplicate, and TNO Multi-band Image Data Collection. The experimental results show that the proposed method can achieve superior performance compared with many state-of-the-art methods on both tasks.

Highlights

  • With the rapid development of portable multimedia sensors and mobile Internet technologies, the amount of multimedia data has been growing at a really fast speed

  • To tackle the spatial variations, we introduce a spatial transformer (ST) module into the model, termed as ST-comparing CNN (CCNN), to learn the features which are robust to cropping, translation, scaling, and non-rigid transformations

  • We further propose the ST-CCNN model by introducing a spatial transformer module into the comparing convolutional neural network (CNN) architecture, which can improve the robustness to variations, such as cropping, translation, scaling, and non-rigid transformations

Read more

Summary

Introduction

With the rapid development of portable multimedia sensors and mobile Internet technologies, the amount of multimedia data has been growing at a really fast speed. Conventional human-crafted local features-based approaches, such as the widely adopted scale-invariant feature transform (SIFT) [5], histograms of oriented gradients (HOG) [6] descriptors, obtain the image-level features through aggregating strategies, like vector of locally aggregated descriptors (VLAD) [7], Fisher Vector [8], etc. These methods suffer from the problem of complicated extraction steps and limited representation abilities

Results
Discussion
Conclusion
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.