Image scene geometry recognition using low-level features fusion at multi-layer deep CNN

Altaf Khan,Alexander Chefranov,Hasan Demirel

doi:10.1016/j.neucom.2021.01.085

Abstract

The image scene geometry recognition is an important element for reconstructing the 3D scene geometry of a single image. It is useful for computer vision applications, such as 3D TV, video categorization, and robot navigation system. A 3D scene geometry with a unique depth represents a rough structure of 2D images. An approach to efficient implementation and achieving high recognition accuracy of 3D scene geometry remains a significant challenges in the computer vision domain. Existing approaches attempt to use the pre-trained deep convolutional neural networks (CNN) models as feature extractor and also explore the benefits of multi-layer features representation for small or medium-size datasets. However, these studies pay little attention to building a discriminative feature representation by exploring the benefits of low-level features fusion with multi-layer feature from a single CNN model. To address this problem, we propose a novel model of image scene geometry recognition in which the low-level handcrafted features are integrated with deep CNN multi-stage features (HF-MSF) by using the feature-fusion and score-level fusion strategies. The low-level features contain rich discriminative information of 3D scene geometry, including shape, color, and depth estimation. In feature-fusion, the multi-layer features at different stages and handcrafted features are fused at an early phase, and in score-level fusion, the handcrafted features are integrated with multi-layer feature of a single CNN model at different stages and each stage is connected with a classifier and then score-level fusion of these classifiers is performed automatically to recognize the scene geometry type. For validation and comparison purposes, two well-known deep learning architectures, namely GoogLeNet and ResNet are employed as a backbone of proposed model. Experimental results exhibited that by taking the advantages of both types of fusion, the proposed HF-MSF model has an improved recognition accuracy of 12.21% and 4.96% when compared to G-MS2F model for 12-Scene and 15-Scene image datasets, respectively. Similarly, it improves the accuracy by 3.85% when compared with the FTOTLM model for the 15-Scene dataset.

Full Text