Abstract

Abstract Duplicate and near-duplicate web pages are the chief concerns for web search engines. In reality, they incur enormous space to store the indexes, ultimately slowing down and increasing the cost of serving results. A variety of techniques have been developed to identify pairs of web pages that are “similar” to each other. The problem of finding near-duplicate web pages has been a subject of research in the database and web-search communities for some years. In order to identify the near duplicate web pages, we make use of sentence level features along with fingerprinting method. When a large number of web documents are in consideration for the detection of web pages, then at first, we use K-mode clustering and subsequently sentence feature and fingerprint comparison is used. Using these steps, we exactly identify the near duplicate web pages in an efficient manner. The experimentation is carried out on the web page collections and the results ensured the efficiency of the proposed approach in dete...

Highlights

  • Web Mining is the branch of data mining which deals with the study of World Wide Web [9]

  • Such search engines depend on huge collections of web pages that are obtained with the help of web crawlers, which traverse the web by subsequent hyperlinks and storing downloaded pages in a large database which is later pointed for efficient execution of user queries [17]

  • Near-duplicate web pages pose a serious threat to the web crawling community and have become the prime concern for the web search engines

Read more

Summary

Introduction

Web Mining is the branch of data mining which deals with the study of World Wide Web [9]. Web crawling is engaged by the search engines to populate a local indexed repository of web pages which is in turn utilized to answer user search queries [18]. Such search engines depend on huge collections of web pages that are obtained with the help of web crawlers, which traverse the web by subsequent hyperlinks and storing downloaded pages in a large database which is later pointed for efficient execution of user queries [17]

Objectives
Results
Conclusion
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call