Abstract

Ensemble acoustic modeling can be used to model different factors that cause variability of acoustic space, and provide different combination to improve the performance of automatic speech recognition (ASR). One of the main concerns is how to partition the training data set to several subsets based on which ensemble models are trained. In this study, we focus on ensemble acoustic modeling concerned with acoustic variability caused by gender and accent for Chinese large vocabulary continuous speech recognition (LVCSR). Considering that gender and accent information may be encoded in local acoustic realizations of a few specific phonetic classes rather than in a global acoustic distribution, we proposed a acoustic space partition method based on broad phonetic class (BPC) modeling of speaker for ensemble acoustic modeling. With the principal component analysis (PCA) of the BPC based speaker representation, we designed two level hierarchical data partitions in the low dimensional speaker factor space that concerned with gender and accent information. Ensemble acoustic models were trained on the partitioned data sets on both levels. Speech recognition results showed that using acoustic models trained based on the first level and second level partitions got 9.73% and 32.29% relative improvements in character error reduction rate, respectively.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call