Abstract

The 21st-century economy is increasingly built around data. Firms and individuals upload and store enormous amount of data. Most of the produced data is stored on private servers, but a considerable part is made publicly available across the 1.83 billion websites available online. These data can be accessed by researchers using web-scraping techniques. Web scraping refers to the process of collecting data from web pages either manually or using automation tools or specialized software. Web scraping is possible and relatively simple thanks to the regular structure of the code used for websites designed to be displayed in web browsers. Websites built with HTML can be scraped using standard text-mining tools, either scripts in popular (statistical) programming languages such as Python, Stata, R, or stand-alone dedicated web-scraping tools. Some of those tools do not even require any prior programming skills. Since about 2010, with the omnipresence of social and economic activities on the Internet, web scraping has become increasingly more popular among academic researchers. In contrast to proprietary data, which might not be feasible due to substantial costs, web scraping can make interesting data sources accessible to everyone. Thanks to web scraping, the data are now available in real time and with significantly more details than what has been traditionally offered by statistical offices or commercial data vendors. In fact, many statistical offices have started using web-scraped data, for example, for calculating price indices. Data collected through web scraping has been used in numerous economic and finance projects and can easily complement traditional data sources.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call