Building a Test Collection for Sorani Kurdish

Kyumars Sheykh Esmaili,Asrin Mohammadi,Purya Aliabadi,Shownem Hakimi,Shahin Salavati,Somayeh Yosefi,Donya Eliassi

doi:10.1109/aiccsa.2013.6616470

Abstract

Despite having a large number of speakers, Sorani - one of the two principle branches of the Kurdish language - is among the less-resourced languages. This paper reports on the outcomes of a project aimed at providing the essential resources for processing Sorani texts. The primary output of this project is Pewan, the first standard Test Collection to evaluate Sorani Information Retrieval systems. The other language resources that we have constructed in this project are: (i) a light-stemmer, (ii) a list of affixes, and (iii) a list of stopwords. We also used these newly-built resources to study the effectiveness of basic IR strategies on Sorani documents. Our experimental results show that normalization and, to a lesser extent, stemming can greatly improve the performance of Sorani IR systems.

Full Text