Abstract
This paper presents an algorithm for formant tracking using HMMs and analyzes the influence of HMM initialization, training and context-dependency on the accuracy of the formant tracks obtained with the HMMs. Formant trackers usually include two different phases: one in which the speech is analyzed and formant candidates are obtained, and another in which, by imposing different constraints, the most likely formants are chosen. While the first stage usually relies on standard spectrum estimation techniques, the second stage has evolved notably in the recent years. Traditionally the second phase tries to impose continuity constraints on the formant selection process. Lately there has been ongoing research to include phonemic knowledge in the second stage to make formant tracking more reliable. In order to incorporate phonemic knowledge newer approaches make use of the orthographic transcription of the speech utterance. From the orthographic transcription, the phonemic transcription is obtained, and from this and the speech itself a phonemic segmentation can be obtained. This phonemic segmentation, along with the phonemic transcription and some knowledge of the nominal formant positions for the different phonemes provides extra information that can be used to obtain more accurate formant tracks. This paper presents a complete HMM-based data-driven algorithm for formant tracking suitable to combine different levels of acoustic and phonemic information. A detailed analysis on the performance of this algorithm is discussed for: different initialization strategies using different levels of knowledge, different degrees of training, and context-independent and dependent HMMs. Experimental speaker-dependent results show that the efficient use of phonemic information in HMM training and context-dependent modeling significantly reduces the formant tracking error rate especially for formants $ F_2$ and $ F_3$ .
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
More From: IEEE Transactions on Audio, Speech and Language Processing
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.