Abstract

In image-text multimodal image classification, the fusion is usually between text features and image features. Such fusion assumes image features from multiple network layers are unitedly fused with text features synchronously. In fact, these image features have different interactions with text features. These interactions have mutual interference in conventional fusion. And since the low-level image features are semantically weak, the fusion between them and text features is not as effective as that between high-level image features and text features. To solve problems above, the paper proposed a framework of multi-level independent fusion between text features and different-level image features. In this framework, the fusions between text features and multi-level image features are conducted asynchronously, where the fusions are independent from each other. Moreover, to improve the fusion efficiency when text features are fused with low-level image features, our method complement semantic information for low-level image features with Twin Pyramid (TP) module which can propagate semantic information top down to them. Substantial experiments on MIMIC-CXR data sets demonstrate that the multi-level independent fusion can effectively concatenate the image features and text features and outperforms the traditional methods.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.