Abstract
Articulatory speech recognizers infer the configuration and movement of a speaker’s vocal apparatus as an intermediate representation between acoustics and phonetics. Articulatory modeling makes coarticulation processes explicit, as opposed to the conventional triphone technique of elaborating all coarticulation combinations. In the present study, hierarchical, interpolating hidden Markov models (HMMs) transform acoustics into articulation and articulation into phonetics. Articulatory and acoustic data with transcriptions are used for both training and testing. The articulatory data, collected by the University of Wisconsin, consist of synchronized recordings during speech of x-ray microbeam positions, throat vibration, and acoustics. State variables of the Markov model represent articulator configurations. These variables change continuously between targets to represent coarticulation. In this model, their state changes are driven by a top-level state variable which changes from phone to phone. The model emits both articulator positions and acoustic analysis coefficients, when viewed as a generator (in recognition, these emissions are the inputs). Variational techniques make the training problem tractable. The latest phonetic-recognition results of the model will be contrasted with baseline HMM systems trained on acoustic and articulatory data. [Work supported in part by NIMH, the NSF/DARPA STC for Computer Graphics and Scientific Visualization, and Hewlett–Packard.]
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.