Abstract

In this paper we present a synthesis of the theoretical fundamentals and some practical aspects of statistical (n-gram) language modeling which is a main part of a large vocabulary statistical speech recognition system. There are presented the unigram, bigram and trigram language models as well as the Good-Turing estimator based Katz back-off smoothing algorithm. There is also described the perplexity measure of a language model used for evaluation. The practical experiments were made on Romanian Constitution corpus. There are also presented the text normalization steps before the language model generation. The results are ARPA-MIT format language models for Romanian language. The models were tested and compared using perplexity measure. Finally some comparisons were made between Romanian and English language modeling and conclusions are drawn.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call