Abstract

We propose an edge capacity based on hub and authority scores, and examine the effects of using the edge capacity on the method for extracting Web communities using maximum flow algorithm proposed by G. Flake et al. (2000). A Web community is a collection of Web pages in which a common (or related) is taken up. In recent years, various methods for finding Web communities have been proposed. G. Flake et al.'s method, which is based on maximum flow algorithm, has a big advantages: topic drift does not easily occur. On the other hand, it sets the edge capacity to a fixed value for every edge, which is one of the major cause of failing to obtain a proper Web community. Our approach, which is using HITS score based edge capacity, effectively extracts Web pages retaining well-balanced in both global and local relations to the given seed node. We examined the effects by the experiments for randomly selected 20 topics using Web archives in Japan crawled in 2002. The result confirmed that the average precision rose approximately 20%.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call