Owing to the rapid progress made in transformer-based models and multi-modal fusion strategies, the performance of multi-modal networks has remarkably improved in recent years. However, intrinsic connections among individual modalities are crucial in multi-modal fusion strategies, presenting substantial opportunities for further enhancement of multi-modal perception in this field. Taking inspiration from the biological mechanisms of the human multi-modal cognitive system, we proposed a novel cross-modality fusion network. This network leverages the lateral occipital complex, which is a region responsible for object-shape perception in human vision and integrates an auxiliary modality derived using the conventional multi-stream model. This integration for learning complementary information across image and language modalities effectively captures the biological mechanisms of human processing of higher-order visual percepts through parameter updates of the intermediate modality to optimize the suitability for further fusion. To evaluate the effectiveness of our approach, we conducted experiments under various settings. Our statistical analyses demonstrate a considerable enhancement in the performance of vision-language tasks, thereby validating biological plausibility of our model.