Abstract

Word n-grams and ergodic HMMs were generally used as statistic language models obtained from a large amount of training samples. These models can express short distance correlations between words, however, it is difficult for them to express the long distance correlations between words. In order to solve this problem, a construction algorithm is proposed for a statistic language model based on the hidden Markov network (HMnet). HMnet can express long distance correlations between words. To show the effectiveness of HMnet, it is compared with n-grams and ergodic HMMs by simple experiments. Training and test samples were randomly generated by a stochastic finite state automaton. HMnet showed lower perplexities than word n-grams and ergodic HMMs at an optimum number of states. However, an algorithm is needed for calculating the optimum number of states. If the test set perplexity is estimated from training samples, the optimum number of states is determined as the number with the minimum test set perplexity. So an estimation algorithm of the test set perplexity from training samples has been proposed. From the experimental results, the estimated values were nearly the same values as those for the test set perplexities.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.