Aiding web crawlers; projecting web page last modification

Adeel Anjum,Adnan Anjum

doi:10.1109/inmic.2012.6511443

Abstract

Due to colossal amount of data on the Web, Web archivists typically make use of Web crawlers for automated collection. The Internet Archive is the largest organization based on a crawling approach in order to maintain an archive of the entire Web. The most important requirement of a Web crawler, specially when they are used for Web archiving, is to be aware of the date (and time) of last modification of a Web page. This strategy has various advantages, most important of them include i) presentation of an up-to-date version of a Web page to the end user ii) ease of adjusting the crawl rate that allows future retrieval of a Web page's version at a given date, or to compute its refresh rate. The typical way for this modification information of a Web page, that is, to use the Last-Modified: HTTP header, unfortunately does not provide correct information every time. In this work, we discuss various techniques that can be used to determine the date of last modification of a Web page with the help of experiments. This will help in adjusting the crawl rate for a specific page and also helps in presenting users with up to date information and thus allowing future versioning of a Web page more meticulous.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Aiding web crawlers; projecting web page last modification

Abstract

Talk to us

Similar Papers

Lead the way for us

Similar Papers

Authors v. Archivers: The Copyright Infringement Battle Over Web Pages
Kinari Patel
SSRN Electronic Journal | VOL. -
Kinari PatelKinari Patel
26 Feb 2007
SSRN Electronic Journal | VOL. -

Vi-DIFF: Understanding Web Pages Changes
Zeynep Pehlivan ... Stéphane Gançarski
-
Zeynep Pehlivan, et. al.Zeynep Pehlivan ... Stéphane Gançarski
01 Jan 2009
01 Jan 2009

Supporting information evolution on the WWW
I Sommerville ... A Dix
World Wide Web | VOL. 1
I Sommerville, et. al.I Sommerville ... A Dix
01 Jan 1998
World Wide Web | VOL. 1

WebEvo: Taming Web Application Evolution via Semantic Structure Change Detection
Fei Shao
-
Fei ShaoFei Shao
01 May 2021
01 May 2021

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Aiding web crawlers; projecting web page last modification

Abstract

Talk to us

Similar Papers