Abstract
We propose an adaptive Bayesian hidden Markov model for fully unsupervised part-of-speech (POS) induction. The proposed model with its inference algorithm has two extensions to the first-order Bayesian HMM with Dirichlet priors. First our algorithm infers the optimal number of hidden states from the training corpus rather than fixes the dimensionality of state space beforehand. The second extension studies the Chinese unknown word processing module which measures similarities from both morphological properties and context distribution. Experimental results showed that both of these two extensions can help to find the optimal categories for Chinese in terms of both unsupervised clustering metrics and grammar induction accuracies on the Chinese Treebank.
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have