Document Clustering Using Semantic Cliques Aggregation

Ajit Kumar,I-Jen Chiang

doi:10.4236/jcc.2015.312004

Abstract

The search engines are indispensable tools to find information amidst massive web pages and documents. A good search engine needs to retrieve information not only in a shorter time, but also relevant to the users’ queries. Most search engines provide short time retrieval to user queries; however, they provide a little guarantee of precision even to the highly detailed users’ queries. In such cases, documents clustering centered on the subject and contents might improve search results. This paper presents a novel method of document clustering, which uses semantic clique. First, we extracted the Features from the documents. Later, the associations between frequently co-occurring terms were defined, which were called as semantic cliques. Each connected component in the semantic clique represented a theme. The documents clustered based on the theme, for which we designed an aggregation algorithm. We evaluated the aggregation algorithm effectiveness using four kinds of datasets. The result showed that the semantic clique based document clustering algorithm performed significantly better than traditional clustering algorithms such as Principal Direction Divisive Partitioning (PDDP), k-means, Auto-Class, and Hierarchical Clustering (HAC). We found that the Semantic Clique Aggregation is a potential model to represent association rules in text and could be immensely useful for automatic document clustering.

Highlights

The explosion of diverse information over the Internet created a need for the automated tools to help web usersHow to cite this paper: Kumar, A. and Chiang, I-J. (2015) Document Clustering Using Semantic Cliques Aggregation
The search engines are indispensable tools to find, filter, and extract the desired information embedded in massive web pages and documents over the Internet [3]
We proposed a straightforward idea on association rules in the document clustering, which included only the concept of support

Summary

Introduction

The explosion of diverse information over the Internet created a need for the automated tools to help web usersHow to cite this paper: Kumar, A. and Chiang, I-J. (2015) Document Clustering Using Semantic Cliques Aggregation. The explosion of diverse information over the Internet created a need for the automated tools to help web users. How to cite this paper: Kumar, A. and Chiang, I-J. Chiang find relevant information [1]-[3]. The search engines are indispensable tools to find, filter, and extract the desired information embedded in massive web pages and documents over the Internet [3]. The search engines often return inconsistent, irrelevant and messy results [4]. The polysemy, phrases, and term dependency bring additional challenges for search-related technologies [5]. A single term is usually not enough to identify the theme ( known as concept) in the documents. We can associate the term mouse with a computer or animal or person to denote different themes

Objectives

Methods

Conclusion