Comprehensible and accurate cluster labels in text clustering

Jerzy Stefanowski ,Dawid Weiss

doi:10.5555/1931390.1931410

Abstract

The purpose of text clustering in information retrieval is to discover groups of semantically related documents. Accurate and comprehensible cluster descriptions (labels) let the user comprehend the collection's content faster and are essential for various document browsing interfaces. The task of creating descriptive, sensible cluster labels is difficult---typical text clustering algorithms focus on optimizing proximity between documents inside a cluster and rely on keyword representation for describing discovered clusters. In the approach called Description Comes First (DCF) cluster labels are as important as document groups---DCF promotes machine discovery of comprehensible candidate cluster labels later used to discover related document groups. In this paper we describe an application of DCF to the k-Means algorithm, including results of experiments performed on the 20-newsgroups document collection. Experimental evaluation showed that DCF does not decrease the metrics used to assess the quality of document assignment and offers good cluster labels in return. The algorithm utilizes search engine's data structures directly to scale to large document collections.

Full Text