Abstract
Abstract A self-learning speaker and channel adaptation technique based on the separation of speech spectral variation sources is developed for improving speaker-independent continuous speech recognition. Statistical methods are formulated to remove spectral biases at the acoustic level and to adapt parameters of Gaussian mixture densities at the phone unit level. The spectral bias is estimated in two steps using unsupervised maximum likelihood estimation: the probability distributions of the speech spectral features are first assumed uniform for severely mismatched channels, and the spectral bias is then reestimated using Gaussian phone models. Unsupervised sequential phone model adaptation (USPA) is performed via Bayesian estimation from the on-line, bias-removed speech data, and iterative phone model adaptation (IPA) is further performed for dictation applications. The task vocabulary size was 853; the grammar perplexity was 105; the test speech data were collected under mismatched recording conditions with each test set containing 198 sentences. Depending on the speakers and channel conditions, the two-step spectral bias removal yielded relative error reductions (RER) of 3% to 11% compared to the conventional cepstral mean removal; the USPA yielded RER of 12% to 26% after the two-step bias removal; the IPA further yielded RER of 8% to 19% after the USPA.
Published Version
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have