Abstract

The challenges of the standard clustering methods and the weaknesses of Apriori algorithm in frequent termset clustering formulate the goal of our research. Based on Association Rules mining, an efficient approach for Web Document Clustering (ARWDC) has been devised. An efficient Multi-Tire Hashing Frequent Termsets algorithm (MTHFT) has been used to improve the efficiency of mining association rules by targeting improvement in mining of frequent termset. Then, the documents are initially partitioned based on association rules. Since a document usually contains more than one frequent termset, the same document may appear in multiple initial partitions, i.e., initial partitions are overlapping. After making partitions disjoint, the documents are grouped within the partition using descriptive keywords, the resultant clusters are obtained effectively. In this paper, we have presented an extensive analysis of the ARWDC approach for different sizes of Reuter's datasets. Furthermore the performance of our approach is evaluated with the help of evaluation measures such as, Precision, Recall and F-measure compared to the existing clustering algorithms like Bisecting K-means and FIHC. The experimental results show that the efficiency, scalability and accuracy of the ARWDC approach has been improved significantly for Reuters datasets. The internet has become the largest data repository, facing the problem of information overload. The existence of an abundance of information, in combination with the dynamic and heterogeneous nature of the Web, makes information retrieval a tedious process for the average user. Search engines, Meta-Search engines and Web Directories have been developed in order to help the users quickly and easily satisfy their information need. The Search engine performs exact matching between the query terms and the keywords that characterize each web page and presents the results to the user. These results are long lists of URLs, which are very hard to search. Furthermore, users without domain expertise are not familiar with the appropriate terminology thus not submitting the right query terms, leading to the retrieval of more irrelevant pages. This has led to the need for the development of new techniques to assist users effectively navigate, trace and organize the available web documents, with the ultimate goal of finding those best matching their needs. Document Clustering is one of the techniques that can play an important role towards the achievement of this objective. Document clustering has become an increasingly important task in analyzing huge numbers of documents distributed among various sites. Furthermore organizing them into different groups called as clusters, where the documents in each cluster share some common properties according to defined similarity measure. The fast and high-quality document clustering algorithms play an important role in helping users to effectively navigate, summarize, and organize the information. Document clustering has been studied intensively because of its wide applicability in areas such as Web Mining, Search Engines, Information Retrieval, and Topological Analysis. Document Clustering is different than document classification. In document classification, the classes (and their properties) are known a priori, and documents are assigned to these classes; whereas, in document clustering, the number, properties, or membership (composition) of classes is not known in advance. Thus, classification is an example of supervised machine learning and clustering that of

Highlights

  • The internet has become the largest data repository, facing the problem of information overload

  • We evaluate the performance of the approach for Web Document Clustering (ARWDC) approach in terms of the efficiency, accuracy and scalability compared to Bisecting K-means and FIHC algorithms

  • We have conducted an extensive analysis of association rules-based web document clustering ARWDC approach

Read more

Summary

INTRODUCTION

The internet has become the largest data repository, facing the problem of information overload. Users without domain expertise are not familiar with the appropriate terminology not submitting the right query terms, leading to the retrieval of more irrelevant pages This has led to the need for the development of new techniques to assist users effectively navigate, trace and organize the available web documents, with the ultimate goal of finding those best matching their needs. In our prior research [27], we have presented an efficient Association Rules-based Web Document Clustering approach (ARWDC). If we found association rules between both words occur together in many documents, we may identify another topic that discusses about operating systems or computers By precisely identifying these hidden topics as the first step and clustering documents based on them, we can improve the accuracy of the clustering solution.

REVIEW OF LITRUTURE
ASSOCIATION RULES BASED CLUSTERING APPROACH
Offline Collecting of Documents stage
Document Preprocessing stage
Documents Clustering Stage
Post processing
EXPERIMENTAL RESULTS AND PERFORMANCE EVALUATION
Evaluation Methods
Experimental Results
CONCLUSION
FUTURE WORK
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call