Abstract

The multilingual focused crawler system combines web content extraction with path configuration to make use of their advantages and achieve automatic collection of network information in multiple languages. Firstly, system selects foreign language keywords according to crawling webpage language and Chinese keywords, and uses initial link to obtain webpage information. Then, it uses path configuration information or web content extraction algorithm based on the distribution line block to get webpage content, and adopts rules or configuration information to acquire new links, published time and title. Next, keywords are used to filter irrelevant information. Finally, results are presented as a list. When users use focused crawler system, the webpage path information can be configured or not according to requirements, and the collected network resources can also be searched or filtered.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call