Auditory/visual speech in multimodal human interfaces

Dominic W Massaro,Michael M Cohen

doi:10.21437/icslp.1994-135

Abstract

AUDITORY/VISUAL SPEECH IN MULTIMODAL HUMAN INTERFACESDominic W. Massaro and Michael M. Cohen{massaro,mmcohen}@fuzzy.ucsc.eduProgram in Experimental PsychologyUniversity of CaliforniaSanta Cruz, CA 95064ABSTRACTIt has long been a hope, expectation, and prediction that speech would be the primarymedium of communication between humans and machines. To date, this dream has not beenrealized. We predict that exploiting the multimodal nature of spoken language will facilitate theuse of this medium. We begin our paper with a general framework for the analysis of speechrecognition by humans and a theoretical model. We then present a system for auditory/visualspeech synthesis that performs complete text-to-speech synthesis. This system should improvethe quality as well as the attractiveness of speech as one of a machine’s primary output commun-ication medium. Mirroring the value of multimodal speech synthesis, multimodal channelsshould also enhance speech recognition by machine.1. INTRODUCTIONSpeech perception is a human skill that rivals our other impressive achievements. Evenafter decades of intense effort, speech recognition by machine remains far inferior to human per-formance. Our thesis is that 1) there are multiple sources of information supporting speech per-ception, 2) the perceiver evaluates each source in parallel with all of the others, and 3) all ofthese sources are combined or integrated to achieve perceptual recognition. Recognition of aword in a sentence is achieved via a variety of bottom-up and top-down sources of information.Top-down sources include contextual, semantic, syntactic, and phonological constraints;bottom-up sources include audible and visible features of the spoken word.Our research is carried out within a framework of a fuzzy logical model of perception(FLMP) in which speech perception is viewed as having available multiple sources of informa-tion supporting the identiﬁcation and interpretation of the language input. The assumptions cen-tral to the model are 1) each source of information is evaluated to give the degree to which thatsource speciﬁes various alternatives, 2) the sources of information are evaluated independentlyof one another, 3) the sources are integrated to provide an overall degree of support for eachalternative, and 4) perceptual identiﬁcation and interpretation follows the relative degree of sup-port among the alternatives.This research paradigm permits us to determine which of the many potentially functionalcues are actually used [1]. The systematic variation of properties of the speech signal combinedwith the quantitative test of models based on different sources of information enables the investi-gator to test the psychological validity of different cues. This paradigm has already proven to beeffective in the study of audible, visible, and bimodal speech perception [1,2]. Thus, ourresearch strategy not only addresses how different sources of information are evaluated andintegrated, it can uncover what sources of information are actually used. We believe that theresearch paradigm confronts both the important psychophysical question of the nature of infor-mation and the process question of how the information is transformed and mapped intobehavior.2. SPEECH BY EYE AS WELL AS BY EARThere is valuable and effective information afforded by a view of the speaker’s face inspeech perception and recognition by humans. Visible speech is particularly effective when theauditory speech is degraded, because of noise, bandwidth ﬁltering, or hearing-impairment[1].

Full Text