Abstract

The number of webpages in the Internet has increased tremendously over the last two decades however only a part of it is indexed by various search engines. This small portion is the indexable web of the Internet and can be usually reachable from a Search Engine. Search engines play a big role in making the World Wide Web accessible to the end user, and how much of the World Wide Web is accessible on the size of the search engine’s index. Researchers have proposed several ways to estimate this size of the indexable web using search engines with and without privileged access to the search engine’s database. Our report provides a summary of methods used in the last two decades to estimate the size of the World Wide Web, as well as describe how this knowledge can be used in other aspects/tasks concerning the World Wide Web.

Highlights

  • The World Wide Web consists of millions of websites and billions of documents which are accessed through a search engine

  • Their lexicon consisted of 2,190,702 terms and ran the experiment for a total of 438,141 one term queries. Their experiments did not consider disjunctive or conjunctive queries. They estimated the size of the indexable web to be more than 11.5 billion which is the sum of the individual index sizes of the four search engines they considered Google, MSN, Ask/Teoma and Yahoo! after considering their overlap. 8 years since the first experiment, Altavista was no longer the most popular website and was subsequently purchased by Yahoo! in 2003, Yahoo! which was later acquired by Verizon in 2017

  • The approaches described here do not require privileged access to a search engine’s database and while the results are influenced by many biases, with sampling bias persistent across all the different methods

Read more

Summary

Introduction

The World Wide Web consists of millions of websites and billions of documents which are accessed through a search engine. Google has the biggest index size, which means it covers a lot more of the World Wide Web than the rest of the search engines combined as appeared in [1] and in a more recent work in [2] which in turn cater to a wider audience. One of the immediate reasons why Google dominates the other search engines is its index size, which is the number of documents it has indexed at a point in time It is bigger than all the other search engines combined, which gives it a tremendous competitive advantage. What this means is that, Google covers a lot more of the Web than the search engines, attracting a wider audience. Multimodal Technologies and Interact. 2018, 2, 12; doi:10.3390/mti2020012 www.mdpi.com/journal/mti

On Webometrics
Study of Overlap
Graph Nature of the World Wide Web
Diameter of the Web Graph
Estimating the Size of the Indexed Web
Search Engines and WWW Size Estimation
Statistical Approach Using Web Page Sampling
Updated Experiment Setting
Size Estimation through Quadrat Sampling
Size Estimation through Extrapolation
Index Stability
Findings
Discussion and Conclusions

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.