Multimodal Machine Learning: Integrating Language, Vision and Speech

Louis-Philippe Morency,Tadas Baltrušaitis

doi:10.18653/v1/p17-5002

Abstract

Multimodal machine learning is a vibrant multi-disciplinary research field which addresses some of the original goals of artificial intelligence by integrating and modeling multiple communicative modalities, including linguistic, acoustic and visual messages. With the initial research on audio-visual speech recognition and more recently with image and video captioning projects, this research field brings some unique challenges for multimodal researchers given the heterogeneity of the data and the contingency often found between modalities.This tutorial builds upon a recent course taught at Carnegie Mellon University during the Spring 2016 semester (CMU course 11-777) and two tutorials presented at CVPR 2016 and ICMI 2016. The present tutorial will review fundamental concepts of machine learning and deep neural networks before describing the five main challenges in multimodal machine learning: (1) multimodal representation learning, (2) translation & mapping, (3) modality alignment, (4) multimodal fusion and (5) co-learning. The tutorial will also present state-of-the-art algorithms that were recently proposed to solve multimodal applications such as image captioning, video descriptions and visual question-answer. We will also discuss the current and upcoming challenges.

Highlights

: 1. Representation: A first fundamental challenge is to learn how to represent and summarize the multimodal data to highlight the complementarity and synchrony between modalities
Multimodal machine learning is a vibrant multi-disciplinary research field which addresses some of the original goals of artificial intelligence by integrating and modeling multiple communicative modalities, including linguistic, acoustic and visual messages
With the initial research on audio-visual speech recognition and more recently with language & vision projects such as image and video captioning and visual question answering, this research field brings some unique challenges for multimodal researchers given the heterogeneity of the data and the contingency often found between modalities

Summary

Introduction

: 1. Representation: A first fundamental challenge is to learn how to represent and summarize the multimodal data to highlight the complementarity and synchrony between modalities. Multimodal machine learning is a vibrant multi-disciplinary research field which addresses some of the original goals of artificial intelligence by integrating and modeling multiple communicative modalities, including linguistic, acoustic and visual messages. With the initial research on audio-visual speech recognition and more recently with language & vision projects such as image and video captioning and visual question answering, this research field brings some unique challenges for multimodal researchers given the heterogeneity of the data and the contingency often found between modalities. The present tutorial will review fundamental concepts of machine learning and deep neural networks before describing the five main challenges in multimodal machine learning: 1.

Results

Conclusion