Abstract

In this paper, two approaches using vocal tract length normalization (VTLN) are examined to deal with the acoustic mismatch due to different speakers in automatic speech recognition for the special case that training data is available only for a small number of speakers. One is the conventional VTLN approach in which both training and test utterances are frequency warped according to the maximum likelihood (ML) based warping factor estimation scheme, in order to normalize the speaker characteristics. The other approach is to build a virtually speaker-independent (SI) acoustic model using artificially generated multiple speaker data by VTLN based frequency warping of training utterances from the limited speakers. To compare the performance of the two approaches, Korean isolated word recognition experiments are performed with a small amount of training data from limited speakers. The experimental results show that the virtually SI acoustic model approach yields better performance than both the conventional VTLN approach and the baseline system in case of very limited training speakers.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call