Feature extraction plays a major role in any form of pattern recognition. Current feature extraction methods used for automatic speech recognition (ASR) and speaker verification rely mainly on properties of speech production (modeled by all-pole filters) and perception (critical-band integration simulated by Mel/Bark filter bank). We propose to use stochastic methods for designing feature extraction methods which are trained to alleviate the unwanted variability present in speech signals. In this paper we show that such data-driven methods provide significant advantages over the conventional methods both in terms of performance of ASR and in providing understanding about the nature of speech signal. The first part of the paper investigates the suitability of the cepstral features obtained by applying discrete cosine transform on logarithmic critical-band power spectra. An alternate set of basis functions were designed by linear discriminant analysis (LDA) of logarithmic critical-band power spectra. Discriminant features extracted by these alternate basis functions are shown to outperform the cepstral features in ASR experiments. The second part of the paper discusses the relevance of non-uniform frequency resolution used by current speech analysis methods like Mel frequency analysis and perceptual linear predictive analysis. It is shown that LDA of the short-time Fourier spectrum of speech yields spectral basis functions which provide comparatively lower resolution to the high-frequency region of spectrum. This is consistent with critical-band resolution and is shown to be caused by the spectral properties of vowel sounds.
Read full abstract