Smart algorithmic based web crawling and scraping with template autoupdate capabilities

Fazal Qudus Khan,Naimat Ullah,Georgios Tsaramirsis,Mohamed Nazmudeen,Sadeeq Jan,Awais Ahmad

doi:10.1002/cpe.6042

Abstract

SummaryWeb scraping is the process of extracting data from web pages and it is an essential part for the generation of datasets. Currently the field is dominated by capable commercial applications, however, there is always a need for web crawling and web scraping applications for custom projects. Developing fit for purpose tools for retrieving and structuring data from web services, cloud systems, and big data is a challenging task. Based on empirical studies, some of the challenges include structural issues, formatting/ presentation, availability, denial of service, size, and information fetching problems with browsers. Additionally, the data become inaccessible after the structure/template of the website changes for example, after the website update. Thus the dataset cannot be updated in the future without manually modifying the parameters of the Web Scraper. In this paper we propose an algorithm capable of autocorrecting the template (web scraping parameters) used for locating the target data and dealing with some common empirical problems. This is very useful in case there is a need for updating the dataset later, as usually, websites tend to change their pages. Moreover, we introduce an implementation of the algorithm via a tool developed for extracting data from the unity asset store. The tool can capture and store data in XML format. The tool extracted a total of 46 785 (40 611 3D and 6174 2D) items, with 35 successful first retries, 11 second retries and 5 fails.

Full Text