Abstract

This study examines relationships between external face movements, tongue movements, and speech acoustics for consonant-vowel (CV) syllables and sentences spoken by two male and two female talkers with different visual intelligibility ratings. The questions addressed are how relationships among measures vary by syllable, whether talkers who are more intelligible produce greater optical evidence of tongue movements, and how the results for CVs compared to those for sentences. Results show that the prediction of one data stream from another is better for C/a/ syllables than C/i/ and C/u/ syllables. Across the different places of articulation, lingual places result in better predictions of one data stream from another than do bilabial and glottal places. Results vary from talker to talker; interestingly, high rated intelligibility do not result in high predictions. In general, predictions for CV syllables are better than those for sentences.

Highlights

  • The effort to create talking machines began several hundred years ago [1, 2], and over the years most speech synthesis efforts have focused mainly on speech acoustics

  • For electromagnetic midsagittal articulography (EMA) data predicted from OPT data, the tongue tip pellet (TT) was better predicted than tongue back (TB) and tongue middle (TM)

  • For EMA data predicted from line spectral pairs (LSPs), Tongue tip (TT) was the worst predicted among the three tongue pellets, as was found for CVs

Read more

Summary

Introduction

The effort to create talking machines began several hundred years ago [1, 2], and over the years most speech synthesis efforts have focused mainly on speech acoustics. With the development of computer technology, the desire to create talking faces along with voices has been inspired by ideas for many potential applications. A better understanding of the relationships between speech acoustics and face and tongue movements would be helpful to develop better synthetic talking faces [2] and for other applications as well. How best to drive a synthetic talking face is a challenging question. A theoretical ideal driving source for face animation is speech acoustics, because the optical and acoustic signals are simultaneous products of speech production. Speech production involves control of various speech articulators to produce acoustic speech signals. Predictable relationships between articulatory movements and speech acoustics are expected, and many researchers have studied such articulatoryto-acoustic relationships (e.g., [6, 7, 8, 9, 10, 11, 12, 13, 14])

Results
Discussion
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call