The field of multimedia research has witnessed significant interest in leveraging multimodal pretrained neural network models to perceive and represent the physical world. Among these models, vision-language pretraining (VLP) has emerged as a captivating topic. Currently, the prevalent approach in VLP involves supervising the training process with paired image-text data. However, limited efforts have been dedicated to exploring the extraction of essential linguistic knowledge, such as semantics and syntax, during VLP and understanding its impact on multimodal alignment. In response, our study aims to shed light on the influence of comprehensive linguistic knowledge encompassing semantic expression and syntactic structure on multimodal alignment. To achieve this, we introduce SNARE , a large-scale multimodal alignment probing benchmark designed specifically for the detection of vital linguistic components, including lexical, semantic, and syntax knowledge. SNARE offers four distinct tasks: Semantic Structure, Negation Logic, Attribute Ownership, and Relationship Composition. Leveraging SNARE , we conduct holistic analyses of six advanced VLP models (BLIP, CLIP, Flava, X-VLM, BLIP2, and GPT-4), along with human performance, revealing key characteristics of the VLP model: (i) Insensitivity to complex syntax structures, relying primarily on content words for sentence comprehension. (ii) Limited comprehension of sentence combinations and negations. (iii) Challenges in determining actions or spatial relations within visual information, as well as difficulties in verifying the correctness of ternary relationships. Based on these findings, we propose the following strategies to enhance multimodal alignment in VLP: (1) Utilize a large generative language model as the language backbone in VLP to facilitate the understanding of complex sentences. (2) Establish high-quality datasets that emphasize content words and employ simple syntax, such as short-distance semantic composition, to improve multimodal alignment. (3) Incorporate more fine-grained visual knowledge, such as spatial relationships, into pretraining objectives. 1
Read full abstract