Abstract

A Uniform Resource Locator (URL) represents the address of a web page in World Wide Web (WWW). A URL grants access to a single web page on the WWW. Here in this paper the main focus is on the URLs addressing the same web page/same content. A web Page can have two or more URLs with which the web page can be accessed. These duplicate URLs can be a serious threat to the entire pipeline of internet searcher administration for indexing and creeping. I am presenting a novel algorithm for detecting canonization rules for normalizing URLs to the original single URL. Here a pattern recognition approach has been used for analyzing textual data. This approach benefits search engines from information about duplicate URLs to optimize the performance of search engine in terms of reduced cost and improved quality.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call