A DEEP AUTOENCODER-BASED REPRESENTATION FOR ARABIC TEXT CATEGORIZATION

Fatima-Zahra El-Alami,Said Ouatik El Alaoui,Abdelkader El Mahdaouy,Noureddine En-Nahnahi

doi:10.32890/jict2020.19.3.4

Fatima-Zahra El-Alami, Said Ouatik El Alaoui + Show 2 more

Open Access

https://doi.org/10.32890/jict2020.19.3.4

Copy DOI

Abstract

Arabic text representation is a challenging assignment for several applications such as text categorization and clustering since the Arabic language is known for its variety, richness and complex morphology. Until recently, the Bag-of-Words remains the most common method for Arabic text representation. However, it suffers from several shortcomings such as semantics deficiency and high dimensionality of feature space. Moreover, most existing methods ignore the explicit knowledge contained in semantic vocabularies such as Arabic WordNet. To overcome these shortcomings, we proposed a deep Autoencoder based representation for Arabic text categorization. It consisted of three stages: (1) Extracting from Arabic WordNet the most relevant concepts based on feature selection processes (2) Features learning via an unsupervised algorithm for text representation (3) Categorizing text using deep Autoencoder. Our method allowed for the consideration of document semantics by combining both implicit and explicit semantics and reducing feature space dimensionality. To evaluate our method, we conducted several experiments on the standard Arabic dataset, OSAC. The obtained results showed the effectiveness of the proposed method compared to state-of-the-art ones.Arabic text representation

Highlights

Text categorization consists of automatically assigning textual documents to their most relevant categories (Swesi & Bakar, 2019)
We propose an Arabic text categorization method based on deep autoencoder to deal with the aforementioned shortcomings such the high dimensionality of feature representation space and lack of semantics
It can be seen that the Chi-square was more effective than the Variance Threshold (VT) for both Bag-of-Words and Bag-of-Concepts representations since it selected the best features based on the probability of interdependence between the term and category

Summary

Introduction

Text categorization consists of automatically assigning textual documents to their most relevant categories (Swesi & Bakar, 2019). Arabic text categorization suffers from several problems ranging from high dimensionality of feature representation space to the lack of semantics. To enhance Arabic text categorization, it is necessary to build an efficient text representation reducing the feature space dimensionality and reflecting text semantics. The Bag-ofWords and character-level n-gram approaches have been widely used and still accomplish highly competitive results (Abu-Errub, 2014; Odeh et al, 2015) These representations fail to extract similarities between words and phrases leading to feature space sparsity and curse of dimensionality. By handling words as independent tokens, semantic dependencies cannot be captured

Objectives

Results

Conclusion

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: Journal of Information and Communication Technology	Publication Date: Jan 1, 2020
Citations: 8	License type: CC BY 4.0

R Discovery Prime

R Discovery Prime

A DEEP AUTOENCODER-BASED REPRESENTATION FOR ARABIC TEXT CATEGORIZATION

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Journal of Information and Communication Technology

Lead the way for us

Similar Papers

Combining Words and Concepts for Automatic Arabic Text Classification
Alaa Alahmadi ... Abdulhussain E Mahdi
-
Alaa Alahmadi, et. al.Alaa Alahmadi ... Abdulhussain E Mahdi
01 Jan 2018
01 Jan 2018

Semantic similarity based approach for reducing Arabic texts dimensionality
Arafat Awajan
International Journal of Speech Technology | VOL. 19
Arafat AwajanArafat Awajan
09 Jun 2015
International Journal of Speech Technology | VOL. 19

Eliminating High-Degree Biased Character Bigrams for Dimensionality Reduction in Chinese Text Categorization
Dejun Xue ... Maosong Sun
-
Dejun Xue, et. al.Dejun Xue ... Maosong Sun
01 Jan 2004
01 Jan 2004

Random Subspace Method in Text Categorization
Mehrdad J Gangeh ... Robert P.W Duin
-
Mehrdad J Gangeh, et. al.Mehrdad J Gangeh ... Robert P.W Duin
01 Aug 2010
01 Aug 2010

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

A DEEP AUTOENCODER-BASED REPRESENTATION FOR ARABIC TEXT CATEGORIZATION

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Journal of Information and Communication Technology