PROBABILISTIC MODELS FOR FOCUSED WEB CRAWLING

Hongyu Liu,Evangelos Milios

doi:10.1111/j.1467-8640.2012.00411.x

Abstract

A focused crawler is an efficient tool used to traverse the Web to gather documents on a specific topic. It can be used to build domain‐specific Web search portals and online personalized search tools. Focused crawlers can only use information obtained from previously crawled pages to estimate the relevance of a newly seen URL. Therefore, good performance depends on powerful modeling of context as well as the quality of the current observations. To address this challenge, we propose capturing sequential patterns along paths leading to targets based on probabilistic models. We model the process of crawling by a walk along an underlying chain of hidden states, defined by hop distance from target pages, from which the actual topics of the documents are observed. When a new document is seen, prediction amounts to estimating the distance of this document from a target. Within this framework, we propose two probabilistic models for focused crawling, Maximum Entropy Markov Model (MEMM) and Linear‐chain Conditional Random Field (CRF). With MEMM, we exploit multiple overlapping features, such as anchor text, to represent useful context and form a chain of local classifier models. With CRF, a form of undirected graphical models, we focus on obtaining global optimal solutions along the sequences by taking advantage not only of text content, but also of linkage relations. We conclude with an experimental validation and comparison with focused crawling based on Best‐First Search (BFS), Hidden Markov Model (HMM), and Context‐graph Search (CGS).

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

PROBABILISTIC MODELS FOR FOCUSED WEB CRAWLING

Abstract

Talk to us

Similar Papers

More From: Computational Intelligence

Lead the way for us

Journal: Computational Intelligence	Publication Date: May 24, 2012
Citations: 19

Similar Papers

A review on conditional random fields as a sequential classifier in machine learning
Dewi Yanti Liliana ... Chan Basaruddin
-
Dewi Yanti Liliana, et. al.Dewi Yanti Liliana ... Chan Basaruddin
01 Aug 2017
01 Aug 2017

CRF Models for Tamil Part of Speech Tagging and Chunking
S Lakshmana Pandian ... T V Geetha
-
S Lakshmana Pandian, et. al.S Lakshmana Pandian ... T V Geetha
01 Jan 2009
01 Jan 2009

Graphical modeling of conditional random fields for human motion recognition
Chih-Pin Liao ... Jen-Tzung Chien
-
Chih-Pin Liao, et. al.Chih-Pin Liao ... Jen-Tzung Chien
01 Mar 2008
01 Mar 2008

Conditional Random Fields Feature Subset Selection Based on Genetic Algorithms for Phosphorylation Site Prediction
Thanh Hai Dang ... Pieter Meysman
-
Thanh Hai Dang, et. al.Thanh Hai Dang ... Pieter Meysman
01 Oct 2009
01 Oct 2009

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

PROBABILISTIC MODELS FOR FOCUSED WEB CRAWLING

Abstract

Talk to us

Similar Papers

More From: Computational Intelligence