Abstract

As automated image analysis progresses, there is increasing interest in richer linguistic annotation of pictures, with attributes of objects (e.g., furry, brown…) attracting most attention. By building on the recent “zero-shot learning” approach, and paying attention to the linguistic nature of attributes as noun modifiers, and specifically adjectives, we show that it is possible to tag images with attribute-denoting adjectives even when no training data containing the relevant annotation are available. Our approach relies on two key observations. First, objects can be seen as bundles of attributes, typically expressed as adjectival modifiers (a dog is something furry, brown, etc.), and thus a function trained to map visual representations of objects to nominal labels can implicitly learn to map attributes to adjectives. Second, objects and attributes come together in pictures (the same thing is a dog and it is brown). We can thus achieve better attribute (and object) label retrieval by treating images as “visual phrases”, and decomposing their linguistic representation into an attribute-denoting adjective and an object-denoting noun. Our approach performs comparably to a method exploiting manual attribute annotation, it out-performs various competitive alternatives in both attribute and object annotation, and it automatically constructs attribute-centric representations that significantly improve performance in supervised object recognition.

Highlights

  • As the quality of image analysis algorithms improves, there is increasing interest in annotating images with linguistic descriptions ranging from single words describing the depicted objects and their properties (Farhadi et al, 2009; Lampert et al, 2009) to richer expressions such as full-fledged image captions (Kulkarni et al, 2011; Mitchell et al, 2012)

  • Russakovsky and Fei-Fei (2010) trained separate SVM classifiers for each attribute in the evaluation dataset in a cross-validation setting. This fully supervised approach can be seen as an ambitious upper bound for zero-shot learning, and we directly compare our performance to theirs using their figure of merit, namely area under the ROC curve (AUC), which is commonly used for binary classification problems

  • The combined FUSED approach outperforms both representations by a large margin (35.81%), confirming that the linguistically-enriched information brought by DEC is to a certain extent complementary to the lowerlevel visual evidence directly exploited by PHOW

Read more

Summary

Introduction

As the quality of image analysis algorithms improves, there is increasing interest in annotating images with linguistic descriptions ranging from single words describing the depicted objects and their properties (Farhadi et al, 2009; Lampert et al, 2009) to richer expressions such as full-fledged image captions (Kulkarni et al, 2011; Mitchell et al, 2012). While the correlation is smaller than for object-noun data (0.23), we conjecture it is sufficient for zero-shot learning of attributes We will confirm this by testing a cross-modal projection function from attributes, such as colors and shapes, onto adjectives in linguistic semantic space, trained on pre-existing annotated datasets covering less than 100 attributes (Experiment 1). We turn to recent work in distributional semantics defining a vector decomposition framework (Dinu and Baroni, 2014) which, given a vector encoding the meaning of a phrase, aims at decoupling its constituents, producing vectors that can be matched to a sequence of words best capturing the semantics of the phrase We adopt this framework to decompose image representations projected onto linguistic space into an adjective-noun phrase. In addition to contributions to image annotation, our work suggests new test beds for distributional semantic representations of nouns and associated adjectives, and provides more in-depth evidence of the potential of the decompositional approach

Cross-Modal Mapping
Decomposition
Representational Spaces
Evaluation Dataset
Experiment 1
Cross-modal training and evaluation
Results and discussion
Experiment 2
Cross-modal training
Object-agnostic models
Object-informed models
Results
Using DEC for attribute-based object classification
Conclusion
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call