Abstract

Semantic gap, which refers to the limitation that low-level hand-crafted visual features insufficiently encode high-level semantic concepts contained in the images, has been a challenging issue in image retrieval and significantly impairs the performance of real-world retrieval systems. Despite massive efforts that have been devoted to developing effective image signatures, e.g., Bag-of-Visual-Words (BOVW), the Fisher Vector (FV) and the Vector of Locally Aggregated Descriptors (VLAD), these mid-level image features still fail to handle the problem of semantic gap and thus lead to suboptimal results. Towards this end, a large body of work focuses on introducing attribute learning into a variety of vision applications. As inherent nature that describes the intrinsic properties of objects, such as color, shape and rigidity, learned attributes serve as intermediate representations that bridge the semantic gap. However, conventional attribute embedding methods are generally developed for image global representation while ignoring local spatial cues, which prevents them from achieving desirable performance. In this paper, we attempt to encode weak spatial information into attribute embedding for effective image retrieval. Specifically, we partition the image into regular grids and extract Classemes attribute vector from each patch, which results in a large pool of Classemes descriptors followed by VLAD aggregation for generating holistic representation. In order to produce a compact and discriminative code, we employ a piecewise Fisher Discriminant Analysis (FDA) for dimension reduction and concatenate all the compressed Classemes into a single vector coined Spatially Pooled Attributes (SPA). Thorough experimental evaluation and comparative study on three public benchmarks demonstrate the superiority of the proposed approach.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call