Abstract
Increase in the number of documents in the corpuses like News groups, government organizations, internet and digital libraries, have led to greater complexity in categorizing and retrieving them. Incorporating semantic features will improve the accuracy of retrieving documents through the method of clustering and which will also pave the way to organize and retrieve the documents more efficiently, from the large available corpuses. Even though clustering based on semantics enhances the quality of clusters, scalability of the system still remains complicated. In this paper, three dynamic document clustering algorithms, namely: Term frequency based MAximum Resemblance Document Clustering (TMARDC), Correlated Concept based MAximum Resemblance Document Clustering (CCMARDC) and Correlated Concept based Fast Incremental Clustering Algorithm (CCFICA) are proposed. From the above three proposed algorithms the TMARDC algorithm is based on term frequency, whereas, the CCMARDC and CCFICA are based on Correlated terms (Terms and their Related terms) concept extraction algorithm. The proposed algorithms were compared with the existing static and dynamic document clustering algorithms by conducting experimental analysis on the dataset chosen from 20Newsgroups and scientific literature. F-measure and Purity have been considered as metrics for evaluating the performance of the algorithms. The experimental results demonstrate that the proposed algorithm exhibit better performance, compared to the four existing algorithms for document clustering.
Highlights
Tremendous growth in the volume of text documents available from various sources like the Internet, digital libraries, news sources, and company-wide intranets has led to an increased interest in developing methods that can help users to effectively navigate, summarize, and organize information, with an ultimate goal of helping the users to find what they are looking for
Experimental results Data set The data set used for the experimental analysis contains 500 abstract articles collected from the Science Direct digital library
Performance metrics F-measure and Purity are the performance measures used to evaluate the quality of document clustering
Summary
Tremendous growth in the volume of text documents available from various sources like the Internet, digital libraries, news sources, and company-wide intranets has led to an increased interest in developing methods that can help users to effectively navigate, summarize, and organize information, with an ultimate goal of helping the users to find what they are looking for. In this context, fast and high-quality document clustering algorithms play an important role, as they have shown to provide both an intuitive navigation/browsing mechanism, by organizing large amounts of information into a small number of meaningful clusters, as well as to greatly improve the retrieval performance either by cluster-driven dimensionality reduction, term-weighting Tang et al (2005), or by query expansion Sammut and Webb (2010). A web search engine often returns thousands of pages in response to a broad query, making it difficult for users to browse or to identify relevant information
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.