Far-Field Multimodal Speech Processing and Conversational Interaction in Smart Spaces

Gerasimos Potamianos,Mark Epstein,Lubos Ures,Matthew Black,Etienne Marcheret,Martin Labsky,Vit Libal,Patrick Lucey,Rajesh Balchandran,Ladislav Seredi,Jing Huang

doi:10.1109/hscma.2008.4538701

Abstract

Robust speech processing constitutes a crucial component in the development of usable and natural conversational interfaces. In this paper we are particularly interested in human-computer interaction taking place in "smart" spaces - equipped with a number of far-field, unobtrusive microphones and camera sensors. Their availability allows multi-sensory and multi-modal processing, thus improving robustness of speech-based perception technologies in a number of scenarios of interest, for example lectures and meetings held inside smart conference rooms, or interaction with domotic devices in smart homes. In this paper, we overview recent work at IBM Research in developing state-of-the-art speech technology in smart spaces. In particular we discuss acoustic scene analysis, speech activity detection, speaker diarization, and speech recognition, emphasizing multi-sensory or multi-modal processing. The resulting technology is envisaged to allow far-field conversational interaction in smart spaces based on dialog management and natural language understanding of user requests.

Full Text