Articulating the construction of a web scraper for massive data extraction

Shreya Upadhyay,Vishal Pant,Shivansh Bhasin,Mahantesh K Pattanshetti

doi:10.1109/icecct.2017.8117827

Abstract

Massive volumes of data are generated by various users, entities, applications and disseminated online. This copious volume of big data is distributed across millions of websites and is available for various applications. Search engines do provide a simple mechanism to access this data. Accessing this data using search engines requires a user to spend time and resources to manually click and download. Clearly, such a manual approach is not scalable for a vast majority of real life applications at the enterprise and organization level. There exist a number of automated approaches to data extraction from the web. Most of these approaches are ad-hoc and domain specific. Therefore, the need for a robust, automated, easy to use framework for extracting content from the web with a minimal human effort across domains appears enticing. The architecture proposed by the authors for a web scraper addresses this gap to harvest data from the web. The proposed web scraping framework offers an easy and feasible approach for parsing and extracting data on a large scale from multiple websites with minimal human intervention. This paper provides an insight into issues relevant to constructing a web scraper and concludes by describing the implementation of a web scraper for harvesting learning objects for an eLearning application.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Articulating the construction of a web scraper for massive data extraction

Abstract

Talk to us

Similar Papers

Lead the way for us

Similar Papers

Unified Parsing Script using Machine Learning
G Sudha Sadasivam ... Visva R Tarun
-
G Sudha Sadasivam, et. al.G Sudha Sadasivam ... Visva R Tarun
18 Mar 2023
18 Mar 2023

Review of Climate Research and Funding 1993 ~ 2017: A Multinomial Logistic Regression Approach
Y Odeyemi ... M Pollind
Journal of Environmental Informatics Letters | VOL. 1
Y Odeyemi, et. al.Y Odeyemi ... M Pollind
01 Jan 2019
Review of Climate Research and Funding 1993 ~ 2017: A Multinomial Logistic Regression Approach
Y Odeyemi ... M Pollind

Web Service Architecture for a Meta Search Engine
K Srinivas ... P.V.S Srinivas
International Journal of Advanced Computer Science and Applications | VOL. 2
K Srinivas, et. al.K Srinivas ... P.V.S Srinivas
01 Jan 2010
International Journal of Advanced Computer Science and Applications | VOL. 2

Search Engine Optimization Challenges and Solutions
Manashwi Singh ... Rejo Mathew
-
Manashwi Singh, et. al.Manashwi Singh ... Rejo Mathew
01 Jan 2020
01 Jan 2020

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Articulating the construction of a web scraper for massive data extraction

Abstract

Talk to us

Similar Papers