Abstract

Categorizing Arabic text documents is considered an important research topic in the field of Natural Language Processing (NLP) and Machine Learning (ML). The number of Arabic documents is tremendously increasing daily as new web pages, news articles, social media contents are added. Hence, classifying such documents in specific classes is of high importance to many people and applications. Convolutional Neural Network (CNN) is a class of deep learning that has been shown to be useful for many NLP tasks, including text translation and text categorization for the English language. Word embedding is a text representation currently used to represent text terms as real-valued vectors in vector space that represent both syntactic and semantic traits of text. Current research studies in classifying Arabic text documents use traditional text representation such as bag-of-words and TF-IDF weighting, but few use word embedding. Traditional ML algorithms have already been used in Arabic text categorization, and good results are achieved. In this study, we present a Multi-Kernel CNN model for classifying Arabic news documents enriched with n-gram word embedding, which we call A Superior Arabic Text Categorization Deep Model (SATCDM). The proposed solution achieves very high accuracy compared to current research in Arabic text categorization using 15 of freely available datasets. The model achieves an accuracy ranging from 97.58% to 99.90%, which is superior to similar studies on the Arabic document classification task.

Highlights

  • Classification of text documents is of high importance for many Natural Language Processing (NLP) technologies

  • This study presents a deep learning model that is based on Convolutional Neural Network (CNN) and n-gram word embedding language models with sub-word information

  • The Superior Arabic Text Categorization Deep Model (SATCDM) dramatically outperforms the other models with accuracy ranging from 97.58% to 99.90%

Read more

Summary

Introduction

Classification of text documents is of high importance for many NLP technologies. Document classification is the process of categorizing documents into classes based on their contents. Classifying Arabic documents has always been a challenge due to the nature of the language itself having rich dialects and enormous numbers of synonyms. It reflects the lack of Arabic resources compared to other languages such as English, inaccurate stemming algorithms, the highderivative nature of the Arabic language, and equivocalness inflicted by diacritic are reasons to make such a classification task so complex [1], [2]. Categorizing Arabic text documents is considered an important research topic in the field of Arabic Natural Language Processing (ANLP) and Machine Learning (ML). Classifying Arabic documents in specific classes is of high importance to many people and applications.

Methods
Results
Conclusion

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.