Abstract

AbstractThe Internet is the largest database of information ever built by mankind. It contains a wide variety of self-explanatory substances obtainable in varied designs such as audio/video, text, and others. However, the poorly designed data that largely fills up the Internet is difficult to extract and hard to use in an automated process. Web scraping cuts this manual job of extracting information and organizing information and provides an easy-to-use way to collect data from the webpages, convert it into some desired format, and store it in some local repository. Owing to the vast scope of applications of Web scraping ranging from lead generation to reputation and brand monitoring, from sentiment analysis to data augmentation in machine learning, many organizations use various tools to extract useful data. This study deals with different Web scraping tools and libraries, categorized into (i) Partial tools, (ii) Libraries and frameworks, and (iii) complete tools that have been developed over the last few years and is extensively used to collect data and convert into structured data to be used for text-processing applications. This paper explores the terms Web scraping and Web crawling, categorizes the tools available in the current market, and enables the reader to make their Web scraper using one such tool. The paper also comments on the legality associated with Web scraping at the end. KeywordsWeb scrapingLegislationWeb data extractionScraping toolDOM tree

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.