Multilingual Focused Crawler System based on Web Content Extraction and Path Configuration

Jie Wang,Lijuan Wang,Sanhong Deng

doi:10.1088/1757-899x/569/5/052030

Jie Wang, Lijuan Wang + Show 1 more

Open Access

https://doi.org/10.1088/1757-899x/569/5/052030

Copy DOI

Abstract

The multilingual focused crawler system combines web content extraction with path configuration to make use of their advantages and achieve automatic collection of network information in multiple languages. Firstly, system selects foreign language keywords according to crawling webpage language and Chinese keywords, and uses initial link to obtain webpage information. Then, it uses path configuration information or web content extraction algorithm based on the distribution line block to get webpage content, and adopts rules or configuration information to acquire new links, published time and title. Next, keywords are used to filter irrelevant information. Finally, results are presented as a list. When users use focused crawler system, the webpage path information can be configured or not according to requirements, and the collected network resources can also be searched or filtered.

Full Text