Design and implementation of crawling algorithm to collect deep web information for web archiving

Hyo-Jung Oh,Dong-Hyun Won,Yong Kim,Chonghyuck Kim,Sung-Hee Park

doi:10.1108/dta-07-2017-0053

Abstract

PurposeThe purpose of this paper is to describe the development of an algorithm for realizing web crawlers that automatically collect dynamically generated webpages from the deep web.Design/methodology/approachThis study proposes and develops an algorithm to collect web information as if the web crawler gathers static webpages by managing script commands as links. The proposed web crawler actually experiments with the algorithm by collecting deep webpages.FindingsAmong the findings of this study is that if the actual crawling process provides search results as script pages, the outcome only collects the first page. However, the proposed algorithm can collect deep webpages in this case.Research limitations/implicationsTo use a script as a link, a human must first analyze the web document. This study uses the web browser object provided by Microsoft Visual Studio as a script launcher, so it cannot collect deep webpages if the web browser object cannot launch the script, or if the web document contains script errors.Practical implicationsThe research results show deep webs are estimated to have 450 to 550 times more information than surface webpages, and it is difficult to collect web documents. However, this algorithm helps to enable deep web collection through script runs.Originality/valueThis study presents a new method to be utilized with script links instead of adopting previous keywords. The proposed algorithm is available as an ordinary URL. From the conducted experiment, analysis of scripts on individual websites is needed to employ them as links.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Design and implementation of crawling algorithm to collect deep web information for web archiving

Abstract

Talk to us

Similar Papers

More From: Data Technologies and Applications

Lead the way for us

Journal: Data Technologies and Applications	Publication Date: Mar 19, 2018
Citations: 5

Similar Papers

Crawling the Deep Web: A Study
... Sumedha Singh
International Journal Of Data Mining And Emerging Technologies | VOL. 2
, et. al. ... Sumedha Singh
01 Jan 2012
International Journal Of Data Mining And Emerging Technologies | VOL. 2

Deep Web: A Residual of e-Public Administration
Srirath Goi Gohwong
SSRN Electronic Journal | VOL. -
Srirath Goi GohwongSrirath Goi Gohwong
01 Jan 2019
SSRN Electronic Journal | VOL. -

The Design and Implementation of a Deep Web Architecture
...
-
, et. al. ...
16 Oct 2012
16 Oct 2012

A Survey on Content Based Crawling for Deep and Surface Web
Nishchay Agrawal ... Suchi Johari
-
Nishchay Agrawal, et. al.Nishchay Agrawal ... Suchi Johari
01 Nov 2019
01 Nov 2019

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Design and implementation of crawling algorithm to collect deep web information for web archiving

Abstract

Talk to us

Similar Papers

More From: Data Technologies and Applications