Abstract
Arabic text representation is a challenging assignment for several applications such as text categorization and clustering since the Arabic language is known for its variety, richness and complex morphology. Until recently, the Bag-of-Words remains the most common method for Arabic text representation. However, it suffers from several shortcomings such as semantics deficiency and high dimensionality of feature space. Moreover, most existing methods ignore the explicit knowledge contained in semantic vocabularies such as Arabic WordNet. To overcome these shortcomings, we proposed a deep Autoencoder based representation for Arabic text categorization. It consisted of three stages: (1) Extracting from Arabic WordNet the most relevant concepts based on feature selection processes (2) Features learning via an unsupervised algorithm for text representation (3) Categorizing text using deep Autoencoder. Our method allowed for the consideration of document semantics by combining both implicit and explicit semantics and reducing feature space dimensionality. To evaluate our method, we conducted several experiments on the standard Arabic dataset, OSAC. The obtained results showed the effectiveness of the proposed method compared to state-of-the-art ones.Arabic text representation
Highlights
Text categorization consists of automatically assigning textual documents to their most relevant categories (Swesi & Bakar, 2019)
We propose an Arabic text categorization method based on deep autoencoder to deal with the aforementioned shortcomings such the high dimensionality of feature representation space and lack of semantics
It can be seen that the Chi-square was more effective than the Variance Threshold (VT) for both Bag-of-Words and Bag-of-Concepts representations since it selected the best features based on the probability of interdependence between the term and category
Summary
Text categorization consists of automatically assigning textual documents to their most relevant categories (Swesi & Bakar, 2019). Arabic text categorization suffers from several problems ranging from high dimensionality of feature representation space to the lack of semantics. To enhance Arabic text categorization, it is necessary to build an efficient text representation reducing the feature space dimensionality and reflecting text semantics. The Bag-ofWords and character-level n-gram approaches have been widely used and still accomplish highly competitive results (Abu-Errub, 2014; Odeh et al, 2015) These representations fail to extract similarities between words and phrases leading to feature space sparsity and curse of dimensionality. By handling words as independent tokens, semantic dependencies cannot be captured
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
More From: Journal of Information and Communication Technology
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.