Abstract
Modern search engines use link structure of the World Wide Web in order to gain better results for ranking the results of users' queries. One of the most popular ranking algorithms which is based on link analysis is HITS. It generates very accurate outputs but because of huge amount of online computations, this algorithm is relatively slow. In this paper we introduce PHITS, a parallelized version of the HITS algorithm that is suitable for working with huge web graphs in a reason- able time. For implementing this algorithm, we use WebGraph framework and we focus on parallelizing access to web graph as the main bottleneck in the HITS algorithm. I. INTRODUCTION Search technology is one of the most important reasons for success of the web. The huge amount of information available on the web, its high growth rate, and its unstructured nature, all increase the need for search engines with high performance and accurate results. One of the major components of each search engine is its ranking algorithm. Traditional Information Retrieval (IR) systems usually use some models like VMS (4) and compute rank of results using content similarity measures between user's query and retrieved documents. But in the context of the web, there are some problems with these approaches. For example, spamming may lead to inefficient ranking. Some methods have been proposed to encounter these problems most of which uses some implicit information which is embedded in the web graph. These methods are known as Link-Analysis based algorithms. PageRank (5) and HITS (Hyperlink Induced Topic Search) (1) are the most well known algorithms in this category. PageRank, which is used by Google for ranking its results, is an offline and query-independent ranking algorithm. This means that the ranking is independent of the specific queries of users and therefore can be done once and used for all of the upcoming queries. On the other hand, HITS is an online and query-dependent algorithm. Being query dependent makes HITS more precise but it has some disadvantages too. In fact, required online computations for this algorithm is too much and the response time of the search engine after submitting queries by users is not acceptable. To overcome this problem, in this paper we will exploit the parallel processing methods to improve the execution performance of the algorithm. The rest of this paper is organized as follows. In section II, link-analysis based algorithms in general and HITS as a special case are discussed. At the end of this section, some of the variations and improvements for the HITS algorithm that are suggested in the literature are also described. Implementing the HITS algorithm and its parallel version, PHITS, are discussed in sections III and IV respectively. Finally, last section of this paper contains conclusion and some ideas for future work in this topic.
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.