Toward Kurdish language processing: Experiments in collecting and processing the AsoSoft text corpus

Hadi Veisi,Hawre Hosseini,Mohammad Mohammadamini

doi:10.1093/llc/fqy074

Abstract

In this article, we introduce the first Kurdish text corpus for Central Kurdish (Sorani) branch, called AsoSoft text corpus. Kurdish language, which is spoken by more than 30 million people, has various dialects. As one of the two main branches of Kurdish, Central Kurdish is the formal dialect of Kurdish literature. AsoSoft text corpus is of size 188 million tokens and has been collected mostly from Web sites, published books, and magazines. The corpus has been normalized and converted into Text Encoding Initiative XML format. In both collecting and processing the text, we have faced several challenges and have proposed solutions to them. About 22% of the corpus is topic annotated with six topic tags, and a topic identification task has been done to evaluate the correctness of annotation. The computational experiments of the Central Kurdish text processing are also presented with the support of related supplementary statistics. For the first time, the validity of Zipf’s law for Central Kurdish is presented and also perplexity of this language is calculated using standard N-gram language models. The perplexity of Central Kurdish is 276 for a tri-gram language model.

Full Text