A survey on multimodal bidirectional machine learning translation of image and natural language processing

Wongyung Nam,Beakcheol Jang

doi:10.1016/j.eswa.2023.121168

Abstract

Advances in multimodal machine learning help artificial intelligence to resemble human intellect more closely, which perceives the world from multiple modalities. We surveyed state-of-the-art research on the modalities of bidirectional machine learning translation of image and natural language processing (NLP), which address a considerable proportion of human life. Recently, with the advances in deep learning model architectures and learning methods in the fields of image and NLP, considerable progress has been made in multimodal machine learning translations that can be built by integrating image and NLP. Our goal is to explore and summarize state-of-the-art research on multimodal machine learning translation and present a taxonomy for the multimodal bidirectional machine learning translation of image and NLP. Furthermore, we reviewed the evaluation metrics and compared state-of-the-art approaches that influences this field. We believe that this survey will become a cornerstone of future research by discussing the challenges in multimodal machine learning translation and direction of future research based on understanding state-of-the-art research in the field.

Full Text