Abstract

The purpose of visual-semantic embedding is to respectively map image and text to a common embedding space and perform cross-modal semantic alignment learning. Image-text matching is also the main research content of visual semantic embedding. Existing researches have confirmed that in visual-semantic embedding, a simple pooling strategy can also achieve a good performance. However, the existing visual semantic pooling strategies (aggregators) generally have some problems, including adding additional training parameters, increasing training time, ignoring intra-modal semantic-related information, and so on. In this paper, we propose a Super Visual Semantic Embedding (SVSE) Model based on Softmax Pooling (SoftPool). We introduced the softmax pooling strategy into visual semantic embedding for the first time. SoftPool is not only simple to implement but also doesn't introduce new additional training parameters. It can adaptively calculate the weights between different feature values and preserve more intra-modal correlation information between different features. At the same time, we combine the enhanced semantic representation module and our softmax pooling strategy to construct the intra-modal semantic association, which is used to improve the performance of the visual semantic embedding in image-text matching. Undoubtedly, our proposed method possesses a higher engineering application value than other methods. Experiments are conducted on two widely used cross-modal image-text datasets, namely MS-COCO and Flickr-30K. Comparing with the best pooling strategy, our proposed softmax pooling strategy not only is better in training time but also outperforms by 0.48% (5K) on MS-COCO and 1.95% on Flickr-30K at R@1 (image retrieval). Moreover, comparing with the best visual semantic embedding model, our proposed SVSE outperforms by 2.83% (5K) on MS-COCO and 4.89% (1K) on Flickr-30K at R@1 (image retrieval), respectively. Our code is available at https://github.com/zengzhixian/SoftPool_SVSE.git.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call