Abstract
It is well known that automatic speech recognition (ASR) requires good spectral analysis in order to have successful ASR accuracy. A wideband spectrogram seems to contain all the needed acoustic information to map any given speech signal into its corresponding sequence of phonemes. (For ASR, language models are often used to augment acoustics, but here we will limit ourselves to acoustic analysis.) Various methods beyond the basic Fourier transform have found success in ASR, e.g., linear predictive analysis, wavelets, and mel-frequency cepstra (MFCC). These have all been focussed on extracting an efficient set of spectral parameters to facilitate phonetic discrimination. Part of the difficulty is separating spectral envelope information from excitation parameters, as variations in pitch are largely viewed as orthogonal to phoneme recognition. Another complicating factor is that amplitude and frequency scales in speech production and perception are better modeled as nonlinear (unlike the linear, fixed-bandwidth approach of Fourier transforms). Modern ASR techniques are far from optimal, as the front-end data compression yielding MFCCs, for a basic 80-ms phoneme, typically has more than 100 parameters, to distinguish among approximately 32 phonemes (a 5-bit choice). We will investigate various ways to render ASR analysis more efficient. [Work supported by NSERC-Canada.]
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.