Web Scraping Using Summarization and Named Entity Recognition (NER)

Bhavya Bhardwaj,J Jaiharie,M Ganesan,Syed Ishtiyaq Ahmed,R Sorabh Dadhich

doi:10.1109/icaccs51430.2021.9441888

Abstract

“In the age of information, ignorance is a choice” Donny Miller, this line perfectly encapsulates the relevance and importance of information in the present digital era. With the rise and spread of the internet, growth and prevalence of social media usage in the youth demographic and tons of data being generated by different businesses and industries, there has been an exponential rise in information being generated on a daily basis on multiple platforms and across all demographics of modern society. This significant rise in information generation has made the development of information processing and analyzing techniques imperative. The spread of the internet across the globe, has made it the largest repository of information and data. Internet companies, stock companies, market analyzers and various other businesses use sophisticated tools and techniques to extract information from the internet. One of the most important and prevalent method of extracting relevant data off the world-wide web is Web Scraping. Web scraping has gained significant popularity due to the ease it offers in extracting information from target webpages and presenting the information in a structured format with no manual intervention. While the traditional approach to web scraping offers significant advantages, it also necessitates foreknowledge of the DOM structure of the target webpages. In the subsequent sections of this publication, an excellent method that allows developers to bypass the aforementioned requirement, and completely automates the process of web crawling and web scraping relevant information from target URLs is presented. In this paper Natural Language Processing (NLP) and Machine Learning (ML) alternatives to the traditional web-scraping approach is presented. To demonstrate the advantages offered by the improved algorithm, an epidemic predictor mapping the spread of a variety of infectious/viral diseases and their impact across the globe is built using the alternative methods provided in the publication.

Full Text