Abstract

Most of the current state-of-the-art speech recognition systems are based on speech signal parametrizations that crudely model the behavior of the human auditory system. However, little or no use is usually made of the knowledge on the human speech production system. A data-driven statistical approach to incorporate this knowledge into ASR would require a substantial amount of data, which are not widely available since their acquisition is difficult and expensive. Furthermore, during recognition, it is nearly impossible to obtain observations of articulators movement. Thus, research on speech production mechanisms in ASR has largely focused on modeling the hidden articulatory trajectories and using prior phonetic and phonological knowledge. Nevertheless, it has been shown that combining the acoustic and articulatory information can lead to improved speech recognition performance. The approach taken in this study is to integrate features extracted from actual articulatory data with acoustic MFCC features in a way that allows recognition using MFCC only. Rather than trying to map articulatory features to the corresponding acoustic features, we use the probabilistic dependency between them. Bayesian Networks (BN) are ideally suited for this purpose. They can model complex joint probability distributions with many discrete and continuous variables and have great flexibility in representing their dependencies. Our speech recognition system is based on the hybrid HMM/BN acoustic model where the BN is used to describe the HMM states’ probability distributions. HMM transitions, on the other hand, model the temporal speech characteristics. Articulatory and acoustic features are represented by different variables of the BN. Dependencies are learned from the observable articulatory and acoustic training data. During recognition, when only the acoustic observations are available, articulatory variables are assumed hidden. We have evaluated our ASR system by using a small database consisting of articulatory and acoustic data recorded from three speakers. The articulatory data are actual measurements of articulators position at several points. In all experiments involving both speaker-dependent and multi-speaker acoustic models, the HMM/BN system outperformed the baseline HMM system trained on acoustic data only. In experimenting with different BN topologies, we found that integrating the velocity and acceleration coefficients calculated as first and second derivatives of the articulatory position data can further improve recognition performance.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.