Abstract

Massive volumes of data are generated by various users, entities, applications and disseminated online. This copious volume of big data is distributed across millions of websites and is available for various applications. Search engines do provide a simple mechanism to access this data. Accessing this data using search engines requires a user to spend time and resources to manually click and download. Clearly, such a manual approach is not scalable for a vast majority of real life applications at the enterprise and organization level. There exist a number of automated approaches to data extraction from the web. Most of these approaches are ad-hoc and domain specific. Therefore, the need for a robust, automated, easy to use framework for extracting content from the web with a minimal human effort across domains appears enticing. The architecture proposed by the authors for a web scraper addresses this gap to harvest data from the web. The proposed web scraping framework offers an easy and feasible approach for parsing and extracting data on a large scale from multiple websites with minimal human intervention. This paper provides an insight into issues relevant to constructing a web scraper and concludes by describing the implementation of a web scraper for harvesting learning objects for an eLearning application.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.