Abstract

In the present work we overview some recently proposed discrete Fourier transform (DFT)- and discrete wavelet packet transform (DWPT)-based speech parameterization methods and evaluate their performance on the speech recognition task. Specifically, in order to assess the practical value of these less studied speech parameterization methods, we evaluate them in a common experimental setup and compare their performance against traditional techniques, such as the Mel-frequency cepstral coefficients (MFCC) and perceptual linear predictive (PLP) cepstral coefficients which presently dominate the speech recognition field. In particular, utilizing the well established TIMIT speech corpus and employing the Sphinx-III speech recognizer, we present comparative results of 8 different speech parameterization techniques.

Highlights

  • The contemporary speech recognition technology is based on the statistical analysis of speech performed through powerful pattern recognition techniques, such as the hidden Markov models (HMM)[1] and dynamic programming procedures, such as the Viterbi algorithm[2]

  • In contrast to the Subband Based Cepstral Coefficients (SBC) and WPF F&D speech features, which are based on the Mel scale, the formulation of the WPSR wavelet packet features exploited the suitability of the various wavelet packet orthonormal transforms for the approximation of the psychoacoustic effect explained by the critical bands concept, which was introduced by Fletcher[23]

  • The number 16 in the brackets after the designation of the wavelet packet-based speech features denotes that these features utilize only the first 16 milliseconds of the speech frame. This is forced by the requirement of the discrete wavelet packet transform (DWPT) analysis that the number of input samples has to be exact power of 2

Read more

Summary

INTRODUCTION

The contemporary speech recognition technology is based on the statistical analysis of speech performed through powerful pattern recognition techniques, such as the hidden Markov models (HMM)[1] and dynamic programming procedures, such as the Viterbi algorithm[2]. In contrast to the SBC and WPF F&D speech features, which are based on the Mel scale, the formulation of the WPSR wavelet packet features exploited the suitability of the various wavelet packet orthonormal transforms for the approximation of the psychoacoustic effect explained by the critical bands concept, which was introduced by Fletcher[23] In their original design the authors used 66 filters to cover the frequency range [0, 4000] Hz. In their original design the authors used 66 filters to cover the frequency range [0, 4000] Hz To adapt this filter-bank to the speech recognition task it was modified to have smoothly increasing frequency resolution as follows: resolution 31.25 Hz for the range [0, 1000] Hz, corresponding to 32 subbands; resolution 62.5 Hz for [1000, 2500] Hz, 24 subbands; resolution 125 Hz, for [2500, 4000] Hz, 12 subbands. This resulted into a WP tree with a total of 92 frequency subbands, which cover the frequency range [125, 6875] Hz

EXPERIMENTAL SETUP
EXPERIMENTAL RESULTS
CONCLUSION
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call