Abstract

Recent advance in deep learning has enabled realistic image-to-image translation of multimodal data. Along with the development, auto-encoders and generative adversarial networks (GAN) have been extended to deal with multimodal input and output. At the same time, multitask learning has been shown to efficiently and effectively address multiple mutually related recognition tasks. Various scene understanding tasks, such as semantic segmentation and depth prediction, can be viewed as cross-modal encoding / decoding, and hence most of the prior work used multimodal (various types of input) datasets for multitask (various types of output) learning. The inter-modal commonalities, such as one across RGB image, depth, and semantic labels, are being exploited while the study is still at an early stage. In this chapter, we introduce several state-of-the-art encoder–decoder methods on multimodal learning as well as a new approach to cross-modal networks. In particular, we detail a multimodal encoder–decoder networks that harnesses the multimodal nature of multitask scene recognition. In addition to the shared latent representation among encoder–decoder pairs, the model also has shared skip connections from different encoders. By combining these two representation sharing mechanisms, it is shown to efficiently learn a shared feature representation among all modalities in the training data.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.