Improving the Generalization of Visual Classification Models Across IoT Cameras via Cross-Modal Inference and Fusion

Qing-Ling Guan,Lei Meng,Yuze Zheng,Qun Hao,Li-Quan Dong

doi:10.1109/jiot.2023.3265645

Abstract

The performance of visual classification models across IoT devices is usually limited by the changes in local environments, resulted from the diverse appearances of the target objects and differences in light conditions and background scenes. To alleviate these problems, existing studies usually introduce the multimodal information to guide the learning process of the visual classification models, making the models extract the visual features from the discriminative image regions. Especially, cross-modal alignment between visual and textual features has been considered as an effective way for this task by learning a domain-consistent latent feature space for the visual and semantic features. However, this approach may suffer from the heterogeneity between multiple modalities, such as the multi-modal features and the differences in the learned feature values. To alleviate this problem, this paper first presents a comparative analysis of the functionality of various alignment strategies and their impacts on improving visual classification. Subsequently, a cross-modal inference and fusion framework (termed as CRIF) is proposed to align the heterogeneous features in both the feature distributions and values. More importantly, CRIF includes a cross-modal information enrichment module to improve the final classification and learn the mappings from the visual to the semantic space. We conduct experiments on four benchmarking datasets, i.e. the Vireo-Food172, NUS-WIDE, MSR-VTT, and ActivityNet Captions datasets. We report state-of-the-art results for basic classification tasks on the four datasets and conduct subsequent experiments on feature alignment and fusion. The experimental results verify that CRIF can effectively improve the learning ability of the visual classification models, and it is a model-agnostic framework that consistently improves the performance of state-of-the-art visual classification models.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Improving the Generalization of Visual Classification Models Across IoT Cameras via Cross-Modal Inference and Fusion

Abstract

Talk to us

Similar Papers

More From: IEEE Internet of Things Journal

Lead the way for us

Journal: IEEE Internet of Things Journal	Publication Date: Sep 15, 2023
Citations: 8

Similar Papers

VGSG: Vision-Guided Semantic-Group Network for Text-based Person Search.
Shuting He ... Xudong Jiang
IEEE transactions on image processing : a publication of the IEEE Signal Processing Society | VOL. PP
Shuting He, et. al.Shuting He ... Xudong Jiang
01 Jan 2024
IEEE transactions on image processing : a publication of the IEEE Signal Processing Society | VOL. PP

Structured Multi-modal Feature Embedding and Alignment for Image-Sentence Retrieval
Xuri Ge ... Zhilong Ji
-
Xuri Ge, et. al.Xuri Ge ... Zhilong Ji
17 Oct 2021
17 Oct 2021

Using ALBERT and Multi-modal Circulant Fusion for Fake News Detection
Xingang Wang ... Xiaoyu Liu
-
Xingang Wang, et. al.Xingang Wang ... Xiaoyu Liu
09 Oct 2022
09 Oct 2022

Visual affective classification by combining visual and text features.
Ningning Liu ... Kai Wang
PloS one | VOL. 12
Ningning Liu, et. al.Ningning Liu ... Kai Wang
29 Aug 2017
PloS one | VOL. 12

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Improving the Generalization of Visual Classification Models Across IoT Cameras via Cross-Modal Inference and Fusion

Abstract

Talk to us

Similar Papers

More From: IEEE Internet of Things Journal