Abstract

Owing to the rapid progress made in transformer-based models and multi-modal fusion strategies, the performance of multi-modal networks has remarkably improved in recent years. However, intrinsic connections among individual modalities are crucial in multi-modal fusion strategies, presenting substantial opportunities for further enhancement of multi-modal perception in this field. Taking inspiration from the biological mechanisms of the human multi-modal cognitive system, we proposed a novel cross-modality fusion network. This network leverages the lateral occipital complex, which is a region responsible for object-shape perception in human vision and integrates an auxiliary modality derived using the conventional multi-stream model. This integration for learning complementary information across image and language modalities effectively captures the biological mechanisms of human processing of higher-order visual percepts through parameter updates of the intermediate modality to optimize the suitability for further fusion. To evaluate the effectiveness of our approach, we conducted experiments under various settings. Our statistical analyses demonstrate a considerable enhancement in the performance of vision-language tasks, thereby validating biological plausibility of our model.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.