Using Cluster Analysis for Author Classification of Albanian Texts: A Study on the Effectiveness of Stop Words

Denisa Kaçorri,Albina Basholli,Luela Prifti

doi:10.37394/232018.2024.12.2

Abstract

Cluster analysis is a statistical approach that identifies uniform clusters within data. The closeness of data is measured quantitatively using distance functions. Specifically for text data mining, clustering serves as a method of categorization of words based on the similarity of their occurrence within texts and classifying texts by topics or author. Hierarchical clustering is a powerful technique for identifying natural groupings within datasets, which can be especially useful for unsupervised text classification. This paper aims to utilize cluster analysis to establish Albanian texts clusters by authors. Using agglomerative hierarchical clustering we classify Albanian texts by authors according to the similarity of their word frequency. The similarity of texts is evaluated using cosine and Euclidean distances. Considering two study cases, respectively with and without Albanian stop words we conclude that the best clustering by authors of the Albanian documents is achieved with 87% accuracy using Ward’s method with cosine distance in the case of study by removing stop words.

Full Text