Abstract

AbstractClustering of Arabic documents is considered as a vital aspect ‎of obtaining optimal results from unsupervised learning. ‎Its aim ‎is to automatically group similar documents into a single cluster ‎using different similarities or distance measures. ‎However, ‎diverse similarities and distance measures are available and their ‎effectiveness in document clustering with a ‎syntactic structure ‎of the stemming is still not obvious. Therefore,‎‏ this study aims to evaluate the impact of five ‎similarity/distance measures (i.e., cosine similarity, the Jaccard coefficient, Pearson’s correlation coefficient, Euclidean ‎distance, and averaged Kullback-Leibler divergence) with two stemming algorithms (i.e., morphology- and syntax-based ‎lemmatization; and morphology-based Information Science Research Institute (ISRI) stemming on clustering Arabic ‎text dataset. We aim to identify the best performing similarity and distance measures and determine which measure is ‎most suitable for Arabic document clustering. Our experimental method, which is based on syntactic structure and ‎morphology, outperformed other stemming methods that use any of the five similarity/distance measures for Arabic ‎document clustering. The best performing similarity/distance measures are cosine similarity and Euclidean distance‎, respectively.KeywordsSimilarity/distance measurespartitional clusteringlemmatization stemmingArabic document clustering

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.