Python based crawling program for collecting data from websites

Xinkai Gao,Jihui Fan,Fengshan Yuan,Daowen Qiu,Ning Sun,Xuexia Ye

doi:10.1117/12.2652497

Abstract

Web crawler (also known as web spider) is a program or script that automatically grabs data from websites according to certain rules. Like a spider crawling along the silk thread of URLs on the internet, it downloads the web page pointed to by each URL, and extracts and analyzes the contents of the page. Through the web crawler program, the massive data on the target website can be automatically collected and saved in a structured file or in database. Therefore, crawler owners can obtain a large amount of data with potential economic value at very low time cost and tiny economic cost. This article takes http://novel.tingroom.com/ as the target, and introduces in detail the general steps of how to use the Python based crawler program to obtain massive data (novel content): Firstly, analyze the structure of the target page. Secondly, use the requests module to get the target page. Thirdly: use the parsel module to extract the valuable parts of the target page. Finally, save the information into structured files. The empirical research shows that the crawler program based on Python has significant practical value.

Full Text