Semantic Document Clustering Using a Similarity Graph

Lubomir Stanchev

doi:10.1109/icsc.2016.8

Abstract

Document clustering addresses the problem of identifying groups of similar documents without human supervision. Unlike most existing solutions that perform document clustering based on keywords matching, we propose an algorithm that considers the meaning of the terms in the documents. For example, a document that contains the words dog and cat multiple times may be placed in the same category as a document that contains the word pet even if the two documents share only noise words in common. Our semantic clustering algorithm is based on a similarity graph that stores the degree of semantic relationship between terms (extracted from WordNet), where a term can be a word or a phrase. We experimentally validate our algorithm on the Reuters-21578 benchmark, which contains 11,362 newswire stories that are grouped in 82 categories using human judgment. We apply the k-means clustering algorithm to group the documents using a similarity metric that is based on keywords matching and one that uses the similarity graph. We show that the second approach produces higher precision and recall, which means that this approach matches closer the results of the human study.

Full Text