Indoor scene classification model based on multi-modal fusion

Yaning Wang,Jianning Li,Weifeng Liu,Zhangming Peng

doi:10.1109/iccais52680.2021.9624487

Abstract

Aiming at the classification of indoor scene images, a multi-modal fusion model is proposed. Firstly, based on the scene image and its semantic description, a single-modal classification model is constructed. For scene images, a convolutional neural network is used to extract features and train classification models. For scene semantic descriptions, a recurrent neural network is used to extract text features. A scene feature space is then constructed and the semantic descriptions are embedded to it to obtain classification results. Secondly, these two kinds of single-modal features are fused and input to a deep neural network after dimensionality reduction, a feature-level fusion model is trained. Finally, two single-modal models and the feature-level fusion model are given different weights to construct a hybrid fusion model, the weights are constantly adjusted to get the best classification accuracy.

Full Text