Using Sequences of Words for Non-Disjoint Grouping of Documents

Chiheb-Eddine Ben N’cir ,Nadia Essoussi

doi:10.1142/s0218001415500135

Abstract

Grouping documents based on their textual content is an important application of clustering referred to as text clustering. This paper deals with two issues in text clustering which are the detection of non-disjoint groups and the representation of textual data. In fact, a text document can discuss several topics and then, it must belong to several groups. The learning algorithm must be able to produce non-disjoint clusters and assigns documents to several clusters. Given that text documents are considered as unstructured data, the application of a learning algorithm requires to prepare a set of documents for numerical analysis by using the vector space model (VSM). This representation of text avoids correlation between terms and does not give importance to the order of words in the text. Therefore, we present in this paper an unsupervised learning method, based on the word sequence kernel, where the correlation between adjacent words in text and the possibility of document to belong to more than one cluster are not ignored. In addition, to facilitate the use of this method in text-analytic practice, we present the "DocCO" software which is publicly available. Experiments performed on several text collections show that the proposed method outperforms existing overlapping methods using VSM representation in terms of clustering accuracy.

Full Text