Abstract

With the significant breakthrough in the research of single-modal related deep learning tasks, more and more works begin to focus on multi-modal tasks. Multi-modal tasks usually involve more than one different modalities, and a modality represents a type of behavior or state. Common multi-modal information includes vision, hearing, language, touch, and smell. Vision and language are two of the most common modalities in human daily life, and many typical multi-modal tasks focus on these two modalities, such as visual captioning and visual grounding. In this paper, we conduct in-depth research on typical tasks of vision and language from the perspectives of generation, analysis, and reasoning. First, the analysis and summary with the typical tasks and some pretty classical methods are introduced, which will be generalized from the aspects of different algorithmic concerns, and be further discussed frequently used datasets and metrics. Then, some other variant tasks and cutting-edge tasks are briefly summarized to build a more comprehensive vision and language related multi-modal tasks framework. Finally, we further discuss the development of pre-training related research and make an outlook for future research. We hope this survey can help relevant researchers to understand the latest progress, existing problems, and exploration directions of vision and language multi-modal related tasks, and provide guidance for future research.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.