Building Statistical Language Models for Persian Continuous Speech Recognition Systems Using the Peykare Corpus

Mohammad Bahrani,Hossein Sameti

doi:10.1142/s1793840611002188

Abstract

In this paper, we build statistical language models for the Persian language using a Persian corpus called Peykare. Then, we incorporate the constructed language models in a Persian continuous speech recognition (CSR) system. First, we unify the different orthographies of words to make the texts of the corpus consistent. In addition, we decrease the number of POS tags used in the corpus by manual clustering. Then, the word-based and the class-based n-gram language models are built using the unified and reduced-tag-set corpus. For building the class-based language models, several methods are used including a new method called LGM-based word clustering. We present the procedure of incorporating language models in a Persian CSR system. Using these language models absolute reductions of up to 13.2% in word error rate were achieved.

Full Text