Chapter 7 - Basic Concepts of Multimodal Analysis

Mihai Gurban,Jean-Philippe Thiran

doi:10.1016/b978-0-12-374825-6.00016-2

Abstract

The word “multimodal” is used by researchers in different fields and often with different meanings. One of its most common uses is in the field of human–computer interaction (HCI). Here, a modality is a natural way of interaction: speech, vision, face expressions, handwriting, gestures, or even head and body movements. Using several such modalities can lead to multimodal speaker tracking systems, multimodal person identification systems, multimodal speech recognizers or, more generally, multimodal interfaces. Such interfaces aim to facilitate HCI, augmenting or even replacing the traditional keyboard and mouse. Multimodal speaker detection, tracking, or localisation consists of identifying the active speaker in an audio-video sequence, which contains several speakers, based on the correlation between the audio and the movement in the video. For multimodal speech recognition, or audio-visual speech recognition, some video information from the speakers' lips is used to augment the audio stream to improve the speech recognition accuracy. A multimodal biometrics system establishes the identity of a person from a list of candidates previously enrolled in the system, based on not one but several modalities, which can be taken from a long list: face images, audio speech, visual speech or lip-reading, fingerprints, iris images, retinal images, handwriting, and gait.

Full Text