Face and Speech Interaction

Mihai Gurban,Ferran Marques,Jean-Philippe Thiran,Veronica Vilaplana

doi:10.1007/978-3-540-78345-9_5

Abstract

Two of the communication channels conveying more information in human-to-human interaction are face and speech. A robust interpretation of the information being expressed by people can be obtained by the combined analysis of both sources of information: short-term facial feature evolution (face) and speech information. This way, face and speech combined analysis is the basis of a large number of human computer interfaces and services. Regardless of the final application of such interfaces, there are two aspects that are commonly required: detection of human faces and combination of both sources of information. In the first section of the chapter, we review the state of the art of face and facial feature detection. The various methods are analyzed from the perspective of the different models that they use to represent images and patterns: pixel based, block based, transform coefficient based and region based techniques. In the second section of the chapter, we present two examples of multimodal signal processing applications. The first one allows the localization of the speaker's mouth in a video sequence, using both the audio signal and the motion extracted from the video. The second application consists in recognizing the spoken words in a video sequence using both the audio and the images of moving lips.

Full Text