Abstract

Cepstral Mean and Variance Normalization (CMVN) is a computationally efficient normalization technique for noise robust speech recognition. The performance of CMVN is known to degrade for short utterances, due to insufficient data for parameter estimation and loss of discriminable information as all utterances are forced to have zero mean and unit variance. In this work, we propose to use posterior estimates of mean and variance in CMVN, instead of the maximum likelihood estimates. This Bayesian approach, in addition to providing a robust estimate of parameters, is also shown to preserve discriminable information without increase in computational cost, making it particularly relevant for Interactive Voice Response (IVR)-based applications. The relative WER reduction of this approach w.r.t. Cepstral Mean Normalization, CMVN and Histogram Equalization are (i) 40.1%, 27% and 4.3% with the Aurora2 database for all utterances, (ii) 25.7%, 38.6% and 30.4% with the Aurora2 database for short utterances, and (iii) 18.7%, 12.6% and 2.5% with the Aurora4 database.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call