Abstract

Detecting the vowel regions in a given speech signal has been a challenging area of research for a long time. A number of works have been reported over the years to accurately detect the vowel regions and the corresponding vowel onset points (VOPs) and vowel end points (VEPs). Effectiveness of the statistical acoustic modeling techniques and the front-end signal processing approaches has been explored in this regard. The work presented in this paper aims at improving the detection of vowel regions as well as the VOPs and VEPs. A number of statistical modeling approaches developed over the years have been employed in this work for the aforementioned task. To do the same, three-class classifiers (vowel, nonvowel and silence) are developed on the TIMIT database employing the different acoustic modeling techniques and the classification performances are studied. Using any particular three-class classifier, a given speech sample is then forced-aligned against the trained acoustic model under the constraints of first-pass transcription to detect the vowel regions. The correctly detected and spurious vowel regions are analyzed in detail to find the impact of semivowel and nasal sound units on the detection of vowel regions as well as on the determination of VOPs and VEPs. In addition to that, a novel front-end feature extraction technique exploiting the temporal and spectral characteristics of the excitation source information in the speech signal is also proposed. The use of the proposed excitation source feature results in the detection of vowel regions that are quite different from those obtained through the mel-frequency cepstral coefficients. Exploiting those differences in the obtained evidences by using the two kinds of features, a technique to combine the evidences is also proposed in order to get a better estimate of the VOPs and VEPs. When the proposed techniques are evaluated on the vowel---nonvowel classification systems developed using the TIMIT database, significant improvements are noted. Moreover, the improvements are noted to hold across all the acoustic modeling paradigms explored in the presented work.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call