Abstract

We investigate speech recognition features related to voicing functions that indicate whether the vocal folds are vibrating. We describe two voicing features, periodicity and jitter, and demonstrate that they are powerful voicing discriminators. The periodicity and jitter features and their first and second time derivatives are appended to a standard 38-dimensional feature vector comprising the first and second time derivatives of the frame energy and the cepstral coefficients with their first and second time derivatives. HMM-based connected-digit (CD) and large-vocabulary (LV) recognition experiments comparing the traditional and extended feature sets show that voicing features and spectral information are complementary and that improved speech recognition performance is obtained by combining the two sources of information. We further conclude that the difference in performance with and without voicing becomes more significant when minimum string error (MSE) training is used than when maximum likelihood (ML) training is used.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call