Abstract

Cluster analysis is a statistical approach that identifies uniform clusters within data. The closeness of data is measured quantitatively using distance functions. Specifically for text data mining, clustering serves as a method of categorization of words based on the similarity of their occurrence within texts and classifying texts by topics or author. Hierarchical clustering is a powerful technique for identifying natural groupings within datasets, which can be especially useful for unsupervised text classification. This paper aims to utilize cluster analysis to establish Albanian texts clusters by authors. Using agglomerative hierarchical clustering we classify Albanian texts by authors according to the similarity of their word frequency. The similarity of texts is evaluated using cosine and Euclidean distances. Considering two study cases, respectively with and without Albanian stop words we conclude that the best clustering by authors of the Albanian documents is achieved with 87% accuracy using Ward’s method with cosine distance in the case of study by removing stop words.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call