Abstract

In recent years, with the rapid development of convolutional neural networks (CNNs), a series of computer vision tasks have been solved. However, scene recognition is still a difficult and challenging problem due to the complexity of scene images. With the emergence of large-scale scene datasets, a single representation generated by a plain CNN is no longer discriminative enough to describe massive scene images. Therefore, in this paper, we propose a comprehensive representation for scene recognition, including enhanced global scene representation, local salient scene representation, and local contextual object representation. We use two pretrained CNNs to extract original feature maps to construct the multiple representations. Specifically, we adopt class activation mapping (CAM) to find salient regions and extract local scene features and employ a bidirectional long short-term module (LSTM) to encode contextual information of objects existing in a scene. In addition, the multiple representations are generated by an end-to-end trainable model, which we call MRNet (multiple representation network). Experimental results on three publicly available scene recognition datasets demonstrate that our proposed model is superior to state-of-the-art models.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call