Abstract

It is often acknowledged that speech signals contain short-term and long-term temporal properties [Rabiner, L., Juang, B.H., 1993. Fundamentals of Speech Recognition, Prentice-Hall, NJ, USA] that are difficult to capture and model by using the usual fixed scale (typically 20 ms) short-time spectral analysis used in hidden Markov models (HMMs), based on piecewise stationarity and state conditional independence assumptions of acoustic vectors. For example, vowels are typically quasi-stationary over 40–80 ms segments, while plosive typically require analysis below 20 ms segments. Thus, fixed scale analysis is clearly sub-optimal for “optimal” time–frequency resolution and modeling of different stationary phones found in the speech signal. In the present paper, we investigate the potential advantages of using variable size analysis windows towards improving state-of-the-art speech recognition systems. Based on the usual assumption that the speech signal can be modeled by a time-varying autoregressive (AR) Gaussian process, we estimate the largest piecewise quasi-stationary speech segments, based on the likelihood that a segment was generated by the same AR process. This likelihood is estimated from the linear prediction (LP) residual error. Each of these quasi-stationary segments is then used as an analysis window from which spectral features are extracted. Such an approach thus results in a variable-scale time spectral analysis, adaptively estimating the largest possible analysis window size such that the signal remains quasi-stationary, thus the best temporal/frequency resolution tradeoff. The speech recognition experiments on the OGI Numbers95 database [Cole, R.A., Fanty, M., Lander, T., 1994. Telephone speech corpus at CSLU. In: Proc. of ICSLP, Yokohama, Japan.], show that the proposed variable-scale piecewise stationary spectral analysis based features indeed yield improved recognition accuracy in clean conditions, compared to features based on minimum cross entropy spectrum [Loughlin, P., Pitton, J., Hannaford, B., 1994. Approximating time–frequency density functions via optimal combinations of spectrograms, IEEE Signal Process. Lett. 1 (12)] as well as those based on fixed scale spectral analysis.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.