Abstract

The improvement of detectability of visible speech cues found by Grant and Seitz [2000. The use of visible speech cues for improving auditory detection of spoken sentences. JASA 108, 1197–1208] has been related to the degree of correlation between acoustic envelopes and visible movements. This suggests that audio and visual signals could interact early during the audio-visual perceptual process on the basis of audio envelope cues. On the other hand, acoustic-visual correlations were previously reported by Yehia et al. [1998. Quantitative association of vocal tract and facial behavior. Speech Commun. 26 (1), 23–43]. Taking into account these two main facts, the problem of extraction of the redundant audio-visual components is revisited: the video parametrization of natural images and three types of audio parameters are tested together, leading to new and realistic applications in video synthesis and audio-visual speech enhancement. Consistent with Grant and Seitz’s prediction, the 4-subband envelope energy features are found to be optimal for encoding the redundant components available for the enhancement task. The proposed computational model of audio-visual interaction is based on the product, in the audio pathway, between the time-aligned audio envelopes and video-predicted envelopes. This interaction scheme is shown to be phonetically neutral, so that it will not bias phonetic identification. The low-level stage which is described is compatible with a late integration process, which may be used as a potential front-end for speech recognition applications.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call