Abstract

The object in an image is the main information of image representation for image classification. In case that the background in the image is complex or an object size is small, the existing invariant feature, such as Scale Invariant Feature Transform (SIFT) or Speeded Up Robust Features (SURF) is not easy to use for object-level representation. Because SIFT can not distinguish whether the feature includes relevant object information, it may consist of background or less informative features. We use Detection Transformer (DETR), the state of the art object detector to represent the object-level information. By visualizing the attention map of Transformer decoder, we find that each output vector indicates the region of objects effectively. Bag of visual words (BoVW) is applied to represent N output vectors of DETR as the feature of a query image. Based on scene-level and object-level datasets, we compare our method with SIFT based BoVW as an image classification task. We show that the proposed method perform better for object-level dataset than BoVW of SIFT.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.