One of the major problems in speech recognition is the inability of trained models to generalize appropriately to channel variations, new speakers, or modified acoustics. The naive observer would believe that a multimillion-parameter system should be sufficient! The difficulty appears to be too many parameters rather than too few. For moderate-sized training corpora, systems learn all of the conditions in the training data rather than generalizing from the exemplars. (For instance, speech recognition algorithms will generally score the speech from a training speaker higher than that from a speaker who was excluded from the set.) One can force the issue by explicitly modeling systematic variation, and then ‘‘normalizing’’ at the front end or in the acoustic model. Two exemplars of this philosophy are Cepstral mean subtraction and vocal tract normalization [Frontiers in Speech Processing 94, LDC96s40, Linguistic Data Consortium (1995)]. In each case a single parameter from a very restrictive model is estimated, and accounting for the variability explicitly improves performance. Concrete examples of these situations are offered, and the impact of this work on future work in automatic speech recognition is discussed.
Read full abstract