Ergodic multigram HMM integrating word segmentation and class tagging for Chinese language modeling

Hubert Hin-Cheung Law Hubert Hin-Cheung Law,Chorkin Chan Chorkin Chan

doi:10.1109/icassp.1996.540324

Abstract

A novel ergodic multigram hidden Markov model (HMM) is introduced which models sentence production as a doubly stochastic process, in which word classes are first produced according to a first order Markov model, and then single or multi-character words are generated independently based on the word classes, without word boundary marked on the sentence. This model can be applied to languages without word boundary markers such as Chinese. With a lexicon containing syntactic classes for each word, its applications include language modeling for recognizers, and integrated word segmentation and class tagging. Pre-segmented and tagged corpus are not needed for training, and both segmentation and tagging are trained in one single model. In this paper, relevant algorithms for this model are presented, and experimental results on a Chinese news corpus are reported.

Full Text