Abstract

Knowledge of phonemes and visemes in language is a vital component of speech-based applications. A phoneme is the nuclear sound unit necessary to symbolize all words in a particular speech. The present definition of viseme is a visual language unit that describes the state of different speech articulators. This chapter discusses the primary task of identifying visemes and the number of frames required to encode the temporal evolution of vowel and consonant phonemes. For this work, an audio-visual Malayalam speech database is created from 23 native speakers of Kerala (18 females and five males). The tongue plays a vital role in the utterance of Malayalam, regarding flexibility and speed, which makes it distinct from other languages. The appearance of teeth and the oral cavity and the shape of the lips can be modeled using geometric features of lips obtained from the hue, saturation, value (HSV) color space, and deformation in the appearance of the lips and tongue can be modeled using the discrete cosine transform (DCT) feature. A linguistically involved, data-driven approach can model individual perception from a linguistic approach with the computational ease of a data-driven approach. The visual speech attributes are then clustered to identify the visual equivalent of the phoneme employing K-means cluster and Gap statistic. To study the temporal variation, we analyzed three phoneme-to-viseme mappings and compared them with the linguistic mapping and visual speech duration.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call