Near-Duplicate Web Page Detection: An Efficient Approach Using Clustering, Sentence Feature and Fingerprinting

J Prasanna Kumar,P Govindarajulu

doi:10.1080/18756891.2013.752657

Abstract

Abstract Duplicate and near-duplicate web pages are the chief concerns for web search engines. In reality, they incur enormous space to store the indexes, ultimately slowing down and increasing the cost of serving results. A variety of techniques have been developed to identify pairs of web pages that are “similar” to each other. The problem of finding near-duplicate web pages has been a subject of research in the database and web-search communities for some years. In order to identify the near duplicate web pages, we make use of sentence level features along with fingerprinting method. When a large number of web documents are in consideration for the detection of web pages, then at first, we use K-mode clustering and subsequently sentence feature and fingerprint comparison is used. Using these steps, we exactly identify the near duplicate web pages in an efficient manner. The experimentation is carried out on the web page collections and the results ensured the efficiency of the proposed approach in dete...

Highlights

Web Mining is the branch of data mining which deals with the study of World Wide Web [9]
Such search engines depend on huge collections of web pages that are obtained with the help of web crawlers, which traverse the web by subsequent hyperlinks and storing downloaded pages in a large database which is later pointed for efficient execution of user queries [17]
Near-duplicate web pages pose a serious threat to the web crawling community and have become the prime concern for the web search engines

Summary

Introduction

Web Mining is the branch of data mining which deals with the study of World Wide Web [9]. Web crawling is engaged by the search engines to populate a local indexed repository of web pages which is in turn utilized to answer user search queries [18]. Such search engines depend on huge collections of web pages that are obtained with the help of web crawlers, which traverse the web by subsequent hyperlinks and storing downloaded pages in a large database which is later pointed for efficient execution of user queries [17]

Objectives

Results

Conclusion

Full Text

Paper version not known

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: International Journal of Computational Intelligence Systems	Publication Date: Jan 1, 2013
Citations: 6	License type: cc-by

R Discovery Prime

R Discovery Prime

Near-Duplicate Web Page Detection: An Efficient Approach Using Clustering, Sentence Feature and Fingerprinting

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: International Journal of Computational Intelligence Systems

Lead the way for us

Similar Papers

Near-duplicate web page detection by enhanced TDW and simHash technique
Arun Pr ... Sumesh Ms
-
Arun Pr, et. al. Arun Pr ... Sumesh Ms
01 Dec 2015
01 Dec 2015

Performance and Comparative Analysis of the Two Contrary Approaches for Detecting Near Duplicate Web Documents in Web Crawling
Va Narayana ... A Govardhan
International Journal of Electrical and Computer Engineering (IJECE) | VOL. 2
Va Narayana, et. al.Va Narayana ... A Govardhan
01 Dec 2012
International Journal of Electrical and Computer Engineering (IJECE) | VOL. 2

Performance and Comparative Analysis of the Two Contrary Approaches for Detecting Near Duplicate Web Documents in Web Crawling
A Govardhan ... V A.Narayana
International Journal of Computer Applications | VOL. 59
A Govardhan, et. al.A Govardhan ... V A.Narayana
18 Dec 2012
International Journal of Computer Applications | VOL. 59

Detection of near duplicate web pages using four stage algorithm
...
-
, et. al. ...
01 Apr 2015
01 Apr 2015

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Near-Duplicate Web Page Detection: An Efficient Approach Using Clustering, Sentence Feature and Fingerprinting

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: International Journal of Computational Intelligence Systems