INTELLIGENT HIGH-PERFORMANCE CRAWLERS USED TO REVEAL TOPIC-SPECIFIC STRUCTURE OF THE WWW

András Lőrincz,István Kókai,Attila Meretei

doi:10.1142/s0129054102001230

Abstract

The slogan that "information is power" has undergone a slight change. Today, "information updating" is in the focus of interest. The largest source of information today is the World Wide Web. Fast search methods are needed to utilize this enormous source of information. In this paper our novel crawler using support vector classification and on-line reinforcement learning is described. We launched crawler searches from different sites, including sites that offer, at best, very limited information about the search subject. This case may correspond to typical searches of non-experts. Results indicate that the considerable performance improvement of our crawler over other known crawlers is due to its on-line adaptation property. We used our crawler to characterize basic topic-specific properties of WWW environments. It was found that topic-specific regions have a broad distribution of valuable documents. Expert sites are excellent starting points, whereas mailing lists can form trape for the crawler. These properties of the WWW and the emergence of intelligent "high-performance" crawlers that monitor and search for novel information together predict a significant increase of communication load on the WWW in the near future.

Full Text