Abstract

In the recent years, Arabic Natural Language Processing, including Text summarization, Text simplification, Text Categorization and other Natural Language-related disciplines, are attracting more researchers. Appropriate resources for Arabic Text Categorization are becoming a big necessity for the development of this research. The few existing corpora are not ready for use, they require preprocessing and filtering operations. In addition, most of them are not organized based on standard classification methods which makes unbalanced classes and thus reduced the classification accuracy. This paper proposes a New Arabic Dataset (NADA) for Text Categorization purpose. This corpus is composed of two existing corpora OSAC and DAA. The new corpus is preprocessed and filtered using the recent state of the art methods. It is also organized based on Dewey decimal classification scheme and Synthetic Minority Over-Sampling Technique. The experiment results show that NADA is an efficient dataset ready for use in Arabic Text Categorization.

Highlights

  • Data collection consists of gathering information to assess the outcomes and validate the research study

  • We present NADA, a New Arabic Dataset built from two existing Arabic corpora and complemented with extra classes and documents

  • This research study is performed to meet the extreme need of Arabic corpora and to overcome the difficulties faced by ANLP researchers especially in ATC field to find an appropriate corpus

Read more

Summary

INTRODUCTION

Data collection consists of gathering information to assess the outcomes and validate the research study. Accessing to freely available corpus is a desirable aim These corpora are not found or not designed for Arabic Text Categorization such as Al-Dostor newspapers [1]. Most of the existing Arabic corpora don’t follow any technique necessary to organize the class hierarchy. This hierarchy helps illustrate the needed classes and keep corpus balanced to accomplish an accurate result. The researchers in this field face a fundamental problem in comparing the results of their proposed methods with those of the state of the art techniques. This makes the validation step more difficult and timeconsuming.

ARABIC LANGUAGE
DEWEY DECIMAL CLASSIFICATION
RELATED WORKS
NADA DATASET SETUP
EXPERIMENTAL RESULTS
CONCLUSION

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.