Abstract

Language is an integral part of human interpersonal communication, which is conveyed through multiple sensory channels. This multisensory communication skill has motivated an extensive number of studies on multimodal information processing, which are trying to develop a system that mimics this natural behaviour. For example, automatic speech recognition (ASR) represents listening activity, text-to-speech (TTS) represents speaking, and various image processing models to represent visual perception. Most are trained and tuned independently using parallel examples from the source to the target modality. However, this is not the case in real-life situations, where a lot of paired data are unavailable. Inspired by this self-supervision of the human auditory and visual perception system, we proposed a multimodal chain mechanism with a weakly-supervised chain training strategy that is trained and tuned jointly. In our proposed framework, when the amount of paired training data are insufficient, collaboration among ASR and TTS, image captioning (IC), and image production models can improve their performance through single or dual-loop chain mechanisms. Our experiment result showed that by using such a closed-loop chain mechanism, we can improve a model with both unpaired and unrelated data from different modalities in a semi-supervised manner. Through the collaboration of speech and visual chains, we improve an ASR model performance with an image-only dataset while maintaining the performance of other models.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call