Abstract

In view of the complexity of the multimodal environment and the existing shallow network structure that cannot achieve high-precision image and text retrieval, a cross-modal image and text retrieval method combining efficient feature extraction and interactive learning convolutional autoencoder (CAE) is proposed. First, the residual network convolution kernel is improved by incorporating two-dimensional principal component analysis (2DPCA) to extract image features and extracting text features through long short-term memory (LSTM) and word vectors to efficiently extract graphic features. Then, based on interactive learning CAE, cross-modal retrieval of images and text is realized. Among them, the image and text features are respectively input to the two input terminals of the dual-modal CAE, and the image-text relationship model is obtained through the interactive learning of the middle layer to realize the image-text retrieval. Finally, based on Flickr30K, MSCOCO, and Pascal VOC 2007 datasets, the proposed method is experimentally demonstrated. The results show that the proposed method can complete accurate image retrieval and text retrieval. Moreover, the mean average precision (MAP) has reached more than 0.3, the area of precision-recall rate (PR) curves are better than other comparison methods, and they are applicable.

Highlights

  • With the advancement of digitalization, more and more people use the Internet to obtain the information they need

  • In order to highlight the effectiveness of the interactive learning convolutional autoencoder (CAE) model proposed in this paper, it is compared with other methods based on the CAE model, such as the text retrieval method based on multimodal semantic automatic encoder (SCAE) proposed in [28]

  • A cross-modal image retrieval method combining efficient feature extraction and interactive learning CAE is proposed. e residual network convolution kernel is improved by incorporating 2DPCA to extract image features, and text features are extracted through long short-term memory (LSTM) and word vectors to obtain image and text features

Read more

Summary

Introduction

With the advancement of digitalization, more and more people use the Internet to obtain the information they need. Ere are many retrieval methods so far, most of which are based on a single modality, such as searching for articles by text, searching for pictures by pictures, or multimodal search on the surface It is in the form of search keywords to query and request the most matching content among many resources on the Internet. How to mine the effective information in these multimodal data is an important problem in the research field of cross-modal retrieval. It can be seen that the core of cross-modal retrieval research is to mine the associated information between different modal data. Due to the diversity and heterogeneity of different modal information, the feature extraction method and unified representation form of each modal become the key to solving the problem [9, 10]. The corpus with the modal alignment of images and text is more common

Related Research
Method Framework
Improved Image Feature Extraction of Convolution Kernel
Retrieval Results
Cross-Modal Convolutional Autoencoder
Experiment and Analysis
Methods
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call