It has been shown that face (lips, cheeks, and chin) information can account to a large extent for visual speech perception in isolated syllables and words. Visual speech synthesis has used small reduced sets of phonemes (‘‘visemes’’), under the theory that perceivers are limited in their ability to extract visual speech information. In this study, lip configurations from a manually segmented sentence database [L. Bernstein et al., J. Acoust. Soc. Am. 107, 2887 (2000)] were analyzed to provide phoneme clusters that are algorithmically distinguishable using mouth vertical/horizontal opening and lip protrusion from the middle position of each segment. The lip feature sample spaces for each phoneme were represented by Gaussian mixture models. Maximum posterior probability classification results were computed for each phoneme. Confusion matrices were generated from the classification results, and a set of confusions with 74% or higher within-group classification correct was judged to be a cluster. Preliminary results from 191 sentences by a single talker generated the following clusters: {/p, b, m/(77%), /f, v/(74%), /w, r/(80%), /t, d, s, z, D, k, n/(88%)}. We will present results analyzing the entire English phoneme set across different talkers and compare the results with visual perceptual clusters. [Work supported in part by the NSF.]
Read full abstract