MIFTP: A Multimodal Multi-Level Independent Fusion Framework with Improved Twin Pyramid for Multilabel Chest X-Ray Image Classification

Jingni Zeng,Yuhang Zhou,Ke Niu,Zhongmin Guo,You Lu,Su Pei

doi:10.1109/ictai56018.2022.00170

Abstract

In image-text multimodal image classification, the fusion is usually between text features and image features. Such fusion assumes image features from multiple network layers are unitedly fused with text features synchronously. In fact, these image features have different interactions with text features. These interactions have mutual interference in conventional fusion. And since the low-level image features are semantically weak, the fusion between them and text features is not as effective as that between high-level image features and text features. To solve problems above, the paper proposed a framework of multi-level independent fusion between text features and different-level image features. In this framework, the fusions between text features and multi-level image features are conducted asynchronously, where the fusions are independent from each other. Moreover, to improve the fusion efficiency when text features are fused with low-level image features, our method complement semantic information for low-level image features with Twin Pyramid (TP) module which can propagate semantic information top down to them. Substantial experiments on MIMIC-CXR data sets demonstrate that the multi-level independent fusion can effectively concatenate the image features and text features and outperforms the traditional methods.

Full Text