KurdSum: A new benchmark dataset for the Kurdish text summarization

Soran Badawi

doi:10.1016/j.nlp.2023.100043

Abstract

Summarizing a text is the process of condensing its content while still maintaining its essential information. With the abundance of digital information available, summarization has become a significant task in various fields, including information retrieval, NLP (Natural Language Processing), and machine learning. This task has been extensively studied in languages such as English and Chinese, but research on Kurdish language summarization is lacking. Therefore, we present the first-ever Kurdish summarization news dataset, KurdSum, which includes over 40,000 texts. We collected news articles from Kurdish websites, preprocessed the data, and manually created a summary for each article. We further assessed the performance of our benchmark dataset on four extractive systems (LEXRANK, TEXTRANK, ORACLE, and LEAD0-3) and three abstractive methods (Pointer-Generator, Sequence-to-Sequence and transformer-abstractive). Our experiments showed that the Pointer-Generator approach yielded superior ROUGE (Recall-Oriented Understudy for Gisting Evaluation) scores compared to other techniques and ORACLE outperformed other extractive methods. Our findings offer a promising direction for the summarization of Kurdish text and can contribute to developing NLP tools for processing the Kurdish language. Likewise, the dataset can serve as a benchmark dataset for Kurdish language summarization and a valuable resource for researchers interested in developing Kurdish summarization models.

Full Text