Abstract

The focused crawler based on semantic analysis is a research hotspot in the field of information retrieval. The domain ontology is generally applied to construct the topic model of the focused crawler. In order to overcome the limitations of builders' knowledge reserve and subjective consciousness in the process of constructing artificially ontology, a semi-automatic construction method of domain ontology based on ontology learning technology combining the latent Dirichlet allocation and the Apriori algorithm is proposed in this article. When evaluating the relevance between a hyperlink and a specific topic, the joint evaluation method considering both the web text and the link structure is usually used. However, the traditional weighted sum method is difficult to reasonably determine the optimal weights of these evaluating indicators. To solve this problem, a multi-objective optimization model for link evaluation and a subsequent multi-objective ant colony optimization algorithm (MOACO) are proposed. In the MOACO, a method of the nearest farthest candidate solution (NFCS) is combined with the fast non-dominated sorting to select a set of Pareto-optimal hyperlinks and guide the crawlers’ search directions. The experimental results of the focused crawling on the domain knowledge of typhoon disasters and rainstorm disasters prove that the ability of the proposed focused crawlers to retrieve topic-relevant webpages.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call