Abstract

In multimodal tasks, the importance of text and image modal information often varies for different input cases. To model the difference of importance of different modal information, we propose a high-performance and highly general Dual-Router Dynamic Framework (DRDF), consisting of Dual-Router, MWF-Layer, experts and expert fusion unit. The text router and image router in Dual-Router take text modal information and image modal information respectively, and MWF-Layer is responsible to determine the importance of modal information. Based on the result of the determination, MWF-Layer generates fused weights for the subsequent experts fusion. Experts can adopt a variety of backbones that match the current multimodal or unimodal task. DRDF features high generality and modularity, and we test 12 backbones such as Visual BERT and their corresponding DRDF instances on the multimodal dataset Hateful memes, and unimodal datasets CIFAR10, CIFAR100, and TinyImagenet. Our DRDF instance outperforms those backbones. We also validate the effectiveness of components of DRDF by ablation studies, and discuss the reasons and ideas of DRDF design.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.