NADA: New Arabic Dataset for Text Classification

Nada Alalyani,Souad Larabi

doi:10.14569/ijacsa.2018.090928

Abstract

In the recent years, Arabic Natural Language Processing, including Text summarization, Text simplification, Text Categorization and other Natural Language-related disciplines, are attracting more researchers. Appropriate resources for Arabic Text Categorization are becoming a big necessity for the development of this research. The few existing corpora are not ready for use, they require preprocessing and filtering operations. In addition, most of them are not organized based on standard classification methods which makes unbalanced classes and thus reduced the classification accuracy. This paper proposes a New Arabic Dataset (NADA) for Text Categorization purpose. This corpus is composed of two existing corpora OSAC and DAA. The new corpus is preprocessed and filtered using the recent state of the art methods. It is also organized based on Dewey decimal classification scheme and Synthetic Minority Over-Sampling Technique. The experiment results show that NADA is an efficient dataset ready for use in Arabic Text Categorization.

Highlights

Data collection consists of gathering information to assess the outcomes and validate the research study
We present NADA, a New Arabic Dataset built from two existing Arabic corpora and complemented with extra classes and documents
This research study is performed to meet the extreme need of Arabic corpora and to overcome the difficulties faced by ANLP researchers especially in ATC field to find an appropriate corpus

Summary

INTRODUCTION

Data collection consists of gathering information to assess the outcomes and validate the research study. Accessing to freely available corpus is a desirable aim These corpora are not found or not designed for Arabic Text Categorization such as Al-Dostor newspapers [1]. Most of the existing Arabic corpora don’t follow any technique necessary to organize the class hierarchy. This hierarchy helps illustrate the needed classes and keep corpus balanced to accomplish an accurate result. The researchers in this field face a fundamental problem in comparing the results of their proposed methods with those of the state of the art techniques. This makes the validation step more difficult and timeconsuming.

ARABIC LANGUAGE

DEWEY DECIMAL CLASSIFICATION

RELATED WORKS

NADA DATASET SETUP

EXPERIMENTAL RESULTS

CONCLUSION

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: International Journal of Advanced Computer Science and Applications	Publication Date: Jan 1, 2018
Citations: 15	License type: cc-by

R Discovery Prime

R Discovery Prime

NADA: New Arabic Dataset for Text Classification

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: International Journal of Advanced Computer Science and Applications

Lead the way for us

Similar Papers

Multi-label Arabic text categorization: A benchmark and baseline comparison of multi-label learning algorithms
Bassam Al-Salemi ... Shahrul Azman Mohd Noah
Information Processing & Management | VOL. 56
Bassam Al-Salemi, et. al.Bassam Al-Salemi ... Shahrul Azman Mohd Noah
22 Oct 2018
Information Processing & Management | VOL. 56

Arabic Text Classification Using Convolutional Neural Network and Genetic Algorithms
Deem Alsaleh ... Souad Larabi-Marie-Sainte
IEEE Access | VOL. 9
Deem Alsaleh, et. al.Deem Alsaleh ... Souad Larabi-Marie-Sainte
01 Jan 2020
IEEE Access | VOL. 9

Arabic Idioms Detection by Utilizing Deep Learning and Transformer-based Models
Hanen Himdi
Procedia Computer Science | VOL. 244
Hanen HimdiHanen Himdi
01 Jan 2024
Procedia Computer Science | VOL. 244

Neural networks for the automation of Arabic text categorization
Saleh M Alsaleem
-
Saleh M AlsaleemSaleh M Alsaleem
01 Jan 2013
01 Jan 2013

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

NADA: New Arabic Dataset for Text Classification

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: International Journal of Advanced Computer Science and Applications