Canonization rules for detecting different URLs

Chanchal Kumari,Divya Joshi,Shailendra Narayan Singh

doi:10.1109/confluence.2016.7508093

Abstract

A Uniform Resource Locator (URL) represents the address of a web page in World Wide Web (WWW). A URL grants access to a single web page on the WWW. Here in this paper the main focus is on the URLs addressing the same web page/same content. A web Page can have two or more URLs with which the web page can be accessed. These duplicate URLs can be a serious threat to the entire pipeline of internet searcher administration for indexing and creeping. I am presenting a novel algorithm for detecting canonization rules for normalizing URLs to the original single URL. Here a pattern recognition approach has been used for analyzing textual data. This approach benefits search engines from information about duplicate URLs to optimize the performance of search engine in terms of reduced cost and improved quality.

Full Text