Abstract

We describe the use of spectral transformation to perform speaker adaptation for HMM based isolated-word speech recognition. The paper describes and compares three methods, namely, minimum mean square error (MMSE), canonical correlation analysis (CCA) and multi-layer perceptrons (MLP), to compute the transformations. Using isolated words from the TI-46 speech corpus, we found that CCA offers the best adaptation performance. Three HMM training and adaptation strategies are also discussed. In the “no-retraining” approach, the spectral transformation is computed from a small amount of adaptation data, and may be used, essentially, for on-line adaptation. The “training-after-adaptation” approach computes transformations prior to off-line HMM training, but produces a better set of models. The third approach is a novel two-stage combination of these approaches which has been found to achieve good adaptation performance while maintaining fast adaptation. Our experiments show that, on average, only around 10% of a new speaker's training data is required for adaptation in order to achieve better recognition accuracy than that obtained using the speaker-dependent models of that new speaker, when the CCA spectral transformation estimation method is used with this two-stage approach.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call