Abstract

Arabic Documents Clustering is an important task for obtaining good results with Search Engines, Information Retrieval (IR) systems, Text Mining Applications especially with the rapid growth of the number of online documents present in Arabic language. Document clustering is the process of segmenting a particular collection of texts into subgroups including content based similar ones. Clustering algorithms are mainly divided into two categories: Hierarchical algorithms and Partition algorithms. In this paper, we propose to study the most popular approach of Hierarchical algorithms: Agglomerative Hierarchical algorithm using seven linkage techniques with a wide variety of distance functions and similarity measures, such as the Euclidean Distance, Cosine Similarity, Jaccard Coefficient, and the Pearson Correlation Coefficient; in order to test their effectiveness on Arabic documents clustering, and finally we recommend the best techniques tested. Furthermore, we propose also to study the effect of using the stemming for the testing dataset to cluster it with the same documents clustering technique and similarity/distance measures cited above. The obtained results show that, on the one hand, the Ward function outperformed the other linkage techniques; on the other hand, the use of the stemming will not yield good results, but makes the representation of the document smaller and the clustering faster.KeywordsArabic Text Mining ApplicationsArabic LanguageArabic Text ClusteringHierarchical ClusteringAgglomerative Hierarchical ClusteringSimilarity MeasuresStemming

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.