Abstract

The good generalization performance of conventional pattern classifiers often relies on the size of training data labeled by costly human labor. These days, publicly available web resources grow explosively, and this allows us to easily obtain abundant and cheap web data. Yet, web data are usually not as cooperative as human labeled data. In this paper, we explore the use of web text data to aid image classification. Without requiring the previous collection of auxiliary data from the web, we directly retrieve the web text information with the aid of the powerful reverse image search engine. We develop a novel textual modeling method named semantic matching neural network (SMNN) that is capable of learning semantic features from the associated text of web images. The SMNN text features have improved reliability and applicability, compared to the text features obtained from other methods. The SMNN text features and convolutional neural network (CNN) visual features are merged into a shared representation, which learns to capture the correlations between the two modalities. Experimental results on benchmark UIUC-Sports, Scene-15, Caltech-256, and Pascal VOC-2012 data sets show that the visual and text modalities of data from different sources are remarkably complementary and the fusion of them achieves substantial performance improvement.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call