Correlated concept based dynamic document clustering algorithms for newsgroups and scientific literature

Jayaraj Jayabharathy,Selvadurai Kanmani

doi:10.1186/2193-8636-1-3

Abstract

Increase in the number of documents in the corpuses like News groups, government organizations, internet and digital libraries, have led to greater complexity in categorizing and retrieving them. Incorporating semantic features will improve the accuracy of retrieving documents through the method of clustering and which will also pave the way to organize and retrieve the documents more efficiently, from the large available corpuses. Even though clustering based on semantics enhances the quality of clusters, scalability of the system still remains complicated. In this paper, three dynamic document clustering algorithms, namely: Term frequency based MAximum Resemblance Document Clustering (TMARDC), Correlated Concept based MAximum Resemblance Document Clustering (CCMARDC) and Correlated Concept based Fast Incremental Clustering Algorithm (CCFICA) are proposed. From the above three proposed algorithms the TMARDC algorithm is based on term frequency, whereas, the CCMARDC and CCFICA are based on Correlated terms (Terms and their Related terms) concept extraction algorithm. The proposed algorithms were compared with the existing static and dynamic document clustering algorithms by conducting experimental analysis on the dataset chosen from 20Newsgroups and scientific literature. F-measure and Purity have been considered as metrics for evaluating the performance of the algorithms. The experimental results demonstrate that the proposed algorithm exhibit better performance, compared to the four existing algorithms for document clustering.

Highlights

Tremendous growth in the volume of text documents available from various sources like the Internet, digital libraries, news sources, and company-wide intranets has led to an increased interest in developing methods that can help users to effectively navigate, summarize, and organize information, with an ultimate goal of helping the users to find what they are looking for
Experimental results Data set The data set used for the experimental analysis contains 500 abstract articles collected from the Science Direct digital library
Performance metrics F-measure and Purity are the performance measures used to evaluate the quality of document clustering

Summary

Introduction

Tremendous growth in the volume of text documents available from various sources like the Internet, digital libraries, news sources, and company-wide intranets has led to an increased interest in developing methods that can help users to effectively navigate, summarize, and organize information, with an ultimate goal of helping the users to find what they are looking for. In this context, fast and high-quality document clustering algorithms play an important role, as they have shown to provide both an intuitive navigation/browsing mechanism, by organizing large amounts of information into a small number of meaningful clusters, as well as to greatly improve the retrieval performance either by cluster-driven dimensionality reduction, term-weighting Tang et al (2005), or by query expansion Sammut and Webb (2010). A web search engine often returns thousands of pages in response to a broad query, making it difficult for users to browse or to identify relevant information

Methods

Results

Conclusion

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: Decision Analytics	Publication Date: Feb 19, 2014
Citations: 17	License type: cc-by

R Discovery Prime

R Discovery Prime

Correlated concept based dynamic document clustering algorithms for newsgroups and scientific literature

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Decision Analytics

Lead the way for us

Similar Papers

A Fast Incremental Clustering Algorithm Based on Grid and Density
Chen Zhuo ... Liu Xiang-Shuang
-
Chen Zhuo, et. al.Chen Zhuo ... Liu Xiang-Shuang
01 Aug 2007
01 Aug 2007

Analysis of Similarity Measures with WordNet Based Text Document Clustering
Nadella Sandhya ... A Govardhan
-
Nadella Sandhya, et. al.Nadella Sandhya ... A Govardhan
01 Jan 2012
01 Jan 2012

Fast Dimension-based Partitioning and Merging clustering algorithm
Tamer F Ghanem ... Mohiy M Hadhoud
Applied Soft Computing | VOL. 36
Tamer F Ghanem, et. al.Tamer F Ghanem ... Mohiy M Hadhoud
22 Jul 2015
Applied Soft Computing | VOL. 36

Investigate the Performance of Document Clustering Approach Based on Association Rules Mining
Noha Negm ... Abdel Badeeh
International Journal of Advanced Computer Science and Applications | VOL. 4
Noha Negm, et. al.Noha Negm ... Abdel Badeeh
01 Jan 2013
International Journal of Advanced Computer Science and Applications | VOL. 4

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Correlated concept based dynamic document clustering algorithms for newsgroups and scientific literature

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Decision Analytics