Unsupervised determination of efficient Korean LVCSR units using a Bayesian Dirichlet process model

Sakriani Sakti,Hisashi Kawai,Andrew Finch,Satoshi Nakamura,Ryosuke Isotani

doi:10.1109/icassp.2011.5947395

Abstract

Korean is an agglutinative language that does not have explicit word boundaries. It is also a highly inflective language that exhibits severe coarticulation effects. These characteristics pose a challenge in developing large-vocabulary continuous speech recognition (LVCSR) systems. Many existing Korean LVCSR systems attempt to overcome these difficulties by defining a set of “word” units using morphological analysis (rule-based) or statistical methods. These approaches usually require a great deal of linguistic knowledge or at least some explicit information about the statistical distribution of the units. However, exceptions or uncommon words (e.g., foreign proper nouns) still exist that cannot be covered by rules alone. In this paper, we investigate the use of an unsupervised, nonparametric Bayesian approach to automatically determining efficient units for a Korean LVCSR system. Specifically, we utilize a Dirichlet process model trained using Bayesian inference through block Gibbs sampling. Our approach provides a principled way of learning units without explicit linguistic knowledge or any static parameters. Experiments were conducted on a travel domain corpus, which includes many foreign words and proper nouns. In our experiments we compared our method to a set of state-of-the-art baseline systems that relied on either morphological analysis or segmentation heuristics. Our system was able to produce a considerably more compact set of “word” units than the best baseline system (the lexical dictionary was approximately half the size), with a recognition accuracy 5.89% higher in terms of the relative word error rate than the best baseline system.

Full Text