MinJoT: Multimodal infusion Joint Training for noise learning in text and multimodal classification problems

Bo Liu,Lejian He,Yuchen Xie,Yuejia Xiang,Li Zhu,Weiping Ding

doi:10.1016/j.inffus.2023.102071

Abstract

Amidst the critical role that high-quality labeled data plays in advancing machine learning, the persistence of noise within widely-used datasets remains a challenge. While noise learning has gained traction within machine learning, particularly in computer vision, its exploration in text and multimodal classification domains has lagged. Furthermore, a comprehensive comparison of noise learning techniques in text and multimodal classification has been lacking, partly due to variations in experimental noise settings across prior studies. Addressing these gaps, this research introduces a pioneering Multimodal Infusion Joint Training (MinJoT) framework featuring a novel co-regularized loss function that seamlessly integrates multimodal information during joint training. This framework notably excels in maintaining model robustness amidst noisy data environments. Adapting established noise learning methods from computer vision to text classification, the study conducts extensive experiments across five English and Chinese textual and multimodal datasets, involving four methods, five noise modes, and seven noise rates. Critically, this work challenges the implicit assumption that widely-used datasets are devoid of noise, revealing that these datasets indeed encompass noise levels ranging from 0.61% to 15.77% which is defined as intrinsic noise in this study. For the first time, the study investigates the impact of intrinsic noise on model performance, categorizing it into distinct levels of ambiguity. To facilitate accurate method comparison, a new dataset, Golden-Chnsenticorp (G-Chnsenticorp), is introduced, carefully crafted to be free of intrinsic noise. This research establishes the inaugural noise learning benchmark for text classification, while simultaneously pioneering the first noise learning framework tailored for multimodal sentiment classification. Through these contributions, the study advances the understanding of noise learning in text and multimodal contexts, providing a novel framework, benchmarks, and insights to propel the field forward.

Full Text