AraDS: Arabic Datasets for Text Mining Approaches

Wael M.S Yafooz,Abdel-Hamid M Emara,Abdullah Alsaeedi

doi:10.1109/icsca57840.2023.10087675

Wael M.S Yafooz, Abdel-Hamid M Emara + Show 1 more

https://doi.org/10.1109/icsca57840.2023.10087675

Copy DOI

Abstract

With the digital transformation, life started depending on the digital world. Hence, there is a massive amount of unstructured textual data produced and accumulated faster. Such data used in many applications such sentiment analysis, topic modeling, summarization, classifications, and clustering. However, researcher's wastes time and effort in collecting data and constructing a dataset to examine and evaluate their models or algorithms. In the Universal Language of English, there are many benchmark datasets. The Arabic language lacks datasets in many domains. This paper introduces an Arabic dataset called AraDS. AraDS consist of three Arabic datasets namely; Arabic Dataset on Herbal Treatments for Diabetes (ADHTD), Arabic Multi-Classification Dataset (AMCD), and the Arabic Sentiment Dataset - Khat (ASDK). These datasets were collected from social media platforms such as YouTube. It contains user-generated comments and video metadata. AraDS publicly available dataset. The assessment of the annotation process has been carried out and evaluated using four methods. They are Cohen's kappa, three Arabic native speaker annotators, accuracy, and F-measure. AMCD and ASDK datasets are balanced, whereas ADHTD is an imbalanced dataset. Two datasets are sided with binary classes and the third one is a multi-class dataset. These datasets can be beneficial for many data, text mining, and sentiment analysis researchers to apply methods and algorithms.

Full Text