Abstract

In this paper, we build statistical language models for the Persian language using a Persian corpus called Peykare. Then, we incorporate the constructed language models in a Persian continuous speech recognition (CSR) system. First, we unify the different orthographies of words to make the texts of the corpus consistent. In addition, we decrease the number of POS tags used in the corpus by manual clustering. Then, the word-based and the class-based n-gram language models are built using the unified and reduced-tag-set corpus. For building the class-based language models, several methods are used including a new method called LGM-based word clustering. We present the procedure of incorporating language models in a Persian CSR system. Using these language models absolute reductions of up to 13.2% in word error rate were achieved.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call