Abstract

Accurately recognizing different categories of sceneries with sophisticated spatial configurations is a useful technique in computer vision and intelligent systems, e.g., scene understanding and autonomous driving. Competitive accuracies have been observed by the deep recognition models recently. Nevertheless, these deep architectures cannot explicitly characterize human visual perception, that is, the sequence of gaze allocation and the subsequent cognitive processes when viewing each scenery. In this paper, a novel spatially aware aggregation network is proposed for scene categorization, where the human gaze behavior is discovered in a semisupervised setting. In particular, as semantically labeling a large quantity of scene images is labor-intensive, a semisupervised and structure-preserved non-negative matrix factorization (NMF) is proposed to detect a set of visually/semantically salient regions from each scenery. Afterward, the gaze shifting path (GSP) is engineered to characterize the process of humans perceiving each scene picture. To deeply describe each GSP, a novel spatially aware CNN termed SA-Net is developed. It accepts input regions with various shapes and statistically aggregates all the salient regions along each GSP. Finally, the learned deep GSP features from the entire scene images are fused into an image kernel, which is subsequently integrated into a kernel SVM to categorize different sceneries. Comparative experiments on six scene image sets have shown the advantage of our method.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call