Natural language inference (NLI) task requires an agent to determine the semantic relation between a premise sentence ( p) and a hypothesis sentence ( h), which demands sufficient understanding about sentences semantic. Due to the issues, such as polysemy, ambiguity, as well as fuzziness of sentences, intense sentence understanding is very challenging. To this end, in this article, we introduce the corresponding image of sentences as reference information for enhancing sentence semantic understanding and representing. Specifically, we first propose an image-enhanced multilevel sentence representation net (IEMLRN), that utilizes the image features from pretrained models for enhancing the sentence semantic understanding at different scales, i.e., lexical-level, phrase-level, and sentence-level. The proposed model advances the performance on NLI tasks by leveraging the pretrained global features of images. However, as these pretrained image features are optimized on specific image classification datasets, they may not show the best performance on NLI tasks. Therefore, we further propose to design an adaptive image feature generator that extracts fine-grained image labels from the corresponding sentences. After that, we extend the IEMLRN to multilevel image-enhanced sentence representation net (MIESR) by utilizing not only the coarse-grained pretrained features of images, but also the fine-grained adaptive features of images. Therefore, sentence semantic can be evaluated and enhanced more comprehensively and precisely. Extensive experiments on two benchmark datasets (SNLI, SICK) clearly show our proposed IEMLRN significantly outperform the state-of-the-art baselines, and our proposed MIESR model achieves the best performance by considering not only the text but also images in an adaptive multigranularities way.
Read full abstract