Abstract

One of the most popular algorithm in processing internet data i.e. webpages is page rank algorithm which is intended to decide the importance of a webpage by assigning a weighting value based on any incoming link to those webpage. However, the large amount of internet data may lead into computational burden in processing those page rank algorithm. To take into account those burden, in this paper we present a page rank processing algorithm over distributed system using Hadoop MapReduce framework called MR PageRank. Our algorithm can be decomposed into three processes, each of which is implemented in one Map and Reduce job. We first parse the raw webpage input to produce title of page and its outgoing links as key and value pair, respectively, as well as total dangling node's weight and total amount of pages. We next calculate the probability of each page and distribute this probability to each of outgoing link evenly. Each of the outgoing weight is shuffled and aggregated based on similarity of page title to update a new weighting value of each page. Notice that in the calculation we consider the dangling node and jumping factor. In the end, all of the page are descendingly sorted based on their weighting values. From the experimental result, we show that our implementation has output with reasonable ordering result. This electronic document is a “live” template and already defines the components of your paper.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call