Abstract

It has been shown that face (lips, cheeks, and chin) information can account to a large extent for visual speech perception in isolated syllables and words. Visual speech synthesis has used small reduced sets of phonemes (‘‘visemes’’), under the theory that perceivers are limited in their ability to extract visual speech information. In this study, lip configurations from a manually segmented sentence database [L. Bernstein et al., J. Acoust. Soc. Am. 107, 2887 (2000)] were analyzed to provide phoneme clusters that are algorithmically distinguishable using mouth vertical/horizontal opening and lip protrusion from the middle position of each segment. The lip feature sample spaces for each phoneme were represented by Gaussian mixture models. Maximum posterior probability classification results were computed for each phoneme. Confusion matrices were generated from the classification results, and a set of confusions with 74% or higher within-group classification correct was judged to be a cluster. Preliminary results from 191 sentences by a single talker generated the following clusters: {/p, b, m/(77%), /f, v/(74%), /w, r/(80%), /t, d, s, z, D, k, n/(88%)}. We will present results analyzing the entire English phoneme set across different talkers and compare the results with visual perceptual clusters. [Work supported in part by the NSF.]

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.