Abstract

In many fields, how to catch the related-topic Web resources is crucial. As a vertical search method, focused crawler has received great attention in recent years. Currently, most focused crawlers consider multiple evaluating factors of the hyperlinks and use the weighted sum approach to compute the priorities of unvisited hyperlinks. However, the proper weighted coefficients are hard to determine, and their unsuitable values may even cause the direction of crawlers to deviate seriously from the topic. To overcome this issue, this article builds a multi-objective optimization model based on Web text and link structure and designs a crawler framework called the Web space evolution (WSE), where a hyperlink bank whose radius is gradually increased is introduced to extend the search scape of crawlers in Web space. To improve the uniformity and diversity of hyperlinks, a nearest and farthest candidate solution method is combined with the fast non-dominated sorting to choose Pareto-optimal solutions (hyperlinks). A domain ontology based on the formal concept analysis is applied to establish the topic model. By incorporating the WSE and the domain ontology into the focused crawling, a novel focused crawler called FCWSEO is proposed to collect topic-relevant webpages. The experimental results on the rainstorm disaster domain show that the FCWSEO outperforms other focused crawler strategies in terms of the quantity and quality of retrieved relevant webpages.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call