Abstract

When the acoustic speech signal is degraded by noise, compensatory information can often be obtained by watching the speaker's mouth [W. H. Sumby and I. Pollack, J. Acoust. Soc. Am. 26, 212–215 (1954)]. The aim of the work to be presented is to explore the structure of this compensatory information, as it might be used to assist automatic speech recognition. Starting with visual images taken from laser disc recordings of speaker's faces [Bernstein et al., J. Acoust. Soc. Am. Suppl. 1 82, S22 (1987)], a small box centered at the mouth is identified and extracted. A variety of neural networks are then taught to process these images based upon two interpretation schemes. The first approach is to classify the visual images as belonging to a particular class. The second approach is to extract acoustic constraints directly from the image. Toward this end, neural networks are trained to estimate the short‐term power spectral envelope of the acoustic signal given the corresponding visual image as input. Performance is evaluated on nine vowels obtained from two speakers. These results will be reported and compared with results from more traditional classification and estimation techniques. [Work supported by AFOSR Contract No. 86‐0246.]

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call