Data-driven speech analysis for ASR

Hynek Hermansky

doi:10.1121/1.426410

Abstract

A typical large vocabulary automatic speech recognition (ASR) system consists of three main components: (1) the feature extraction, (2) the pattern classification, and (3) the language modeling. Replacing hardwired prior knowledge in the pattern classification and language modeling modules by the knowledge derived from the data turned out to be one of most significant advances in ASR research in the past two decades. However, the speech analysis module so far resisted the recent data-oriented revolution and is typically built on textbook knowledge of speech production and perception. Since it is believed that speech was optimized by millenia of human evolution to fit properties of human speech perception, deriving the speech processing knowledge from speech data may make some sense. The work describes some attempts in this direction. The linear discriminant analysis is used to learn about structure of speech signal to derive optimized spectral basis functions and filters (replacing conventional cosines of the cepstral analysis and conventional delta filters for deriving dynamic features) in processing the time-frequency plane of the speech signal. [Work supported by the Department of Defense and by the National Science Foundation.]

Full Text