Abstract

Sampling is the most powerful tool for researchers to study important characteristics of the continuously growing Web. On Web page sampling problem, we collect a number of pages which are representative to the Web population. However, we believe Web sampling greatly differs from generic sampling problem. First of all, the randomness principle can not be applied to Web sampling mechanically; Secondly, randomness on page level should not be the only goal of Web sampling. We believe that there is still space to improve the randomness goal, and other than pursuing randomness on page level, new objectives should be set for host and domain levels. In our work, we designed a new Web sampling method, called the Probability Proportional to the Size of Websites (PPSW for short) sampling. After certain preliminary experiments and analysis, we concluded that no former sampling methods took into account the host and domain level of the Web. Therefore we seek new Web sampling methods that can yield samples that are representative on host and domain level. With regard to the new objective, we redesigned the jumping strategy of the random walk while sampling. This preferential jumping strategy markedly increased the validity of random walk on host and domain level. More particularly, random walk based sampling methods have two configurations: whether the random walk has random jump probability, and whether the random walk is conducted on undirected Web graph with the help of search engine. Controlling these two configurations, together with our newly designed preferential jumping strategy, we conducted four kinds of new sampling experiments. Among the four groups of experiments, the directed one with random jump showed great performance improvement. For evaluating our new PPSW sampling methods, we put forward new objectives, along with corresponding formula. The first two are coverage objectives. Comparatively speaking, the number of domains is several orders of magnitude smaller than the number of Web pages. Usually we are capable of handling this number data. Therefore, we wish the sample can cover as many hosts and domains as possible. In addition to the two coverage objectives which are crude, we also proposed four proportion objectives. These four objectives tell us whether a sample reflects the sizes of hosts and domains from different angles: Domain Host Distribution, Domain Page Distribution, Host Page Distribution and Single Domain Page Distribution. We conducted 150 comparison experiments for the three classical random walk based Web sampling methods and our PPSW sampling methods under a same environments that is as real as possible. By observing the process and results, we discussed their performances in the following aspects: • Conventional Evaluations: e.g., out-, in-degree and PageRank distribution, and “Bucket Standard Deviation”. • New Evaluations: by examining the two coverage and four proportion targets, we found that among all the sampling methods, our PPSW sampling methods has the best performance. • Other Aspects: e.g., the length of walk, the stability and efficiency of sampling methods, the number of starting page set and search engines' influences.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.