Improving scene attribute recognition using web-scale object detectors

Frederick Tung,James J Little

doi:10.1016/j.cviu.2015.05.012

Abstract

Semantic attributes enable a richer description of scenes than basic category labels. While traditionally scenes have been analyzed using global image features such as Gist, recent studies suggest that humans often describe scenes in ways that are naturally characterized by local image evidence. For example, humans often describe scenes by their functions or affordances, which are largely suggested by the objects in the scene. In this paper, we leverage a large collection of modern object detectors trained at the web scale to derive effective high-level features for scene attribute recognition. We conduct experiments using two modern object detection frameworks: a semi-supervised learner that continuously learns object models from web images, and a state-of-the-art deep network. The detector response features improve the state of the art on the standard scene attribute benchmark by 5% average precision, and also capture intuitive object-scene relationships, such as the positive correlation of castles with “vacationing/touring” scenes.

Full Text