Abstract

Aiming at the classification of indoor scene images, a multi-modal fusion model is proposed. Firstly, based on the scene image and its semantic description, a single-modal classification model is constructed. For scene images, a convolutional neural network is used to extract features and train classification models. For scene semantic descriptions, a recurrent neural network is used to extract text features. A scene feature space is then constructed and the semantic descriptions are embedded to it to obtain classification results. Secondly, these two kinds of single-modal features are fused and input to a deep neural network after dimensionality reduction, a feature-level fusion model is trained. Finally, two single-modal models and the feature-level fusion model are given different weights to construct a hybrid fusion model, the weights are constantly adjusted to get the best classification accuracy.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call