Understanding regional context of World Wide Web using common crawl corpus

Muhammad Amir Mehmood,Abdul Waheed,Hafiz Muhammad Shafiq

doi:10.1109/micc.2017.8311752

Abstract

The World Wide Web has emerged as the most important and essential tool for the society. Today, people heavily rely on rich resources available in the web for communication, business, maps, and social networking etc. In addition, people seek web content in their preferred regional language besides English. The global statistics of the world wide web are well known, however, the regional context of the world wide web is poorly understood. This paper presents large scale web study using Common Crawl Corpus of December 2016. We examine 200+ terabytes of data with Amazon's Elastic MapReduce infrastructure. We analyze 2.87 billion web documents with respect to content type, domains, and content language. Furthermore, we explore multi-lingual web pages for European and Asian languages. Our results show that 97.8% of web documents present in our data are “text/html”. In addition, 57.2% of web documents contain content in the English language. Moreover, web content in Russian language has 5.7% share which is more that any other European language. Furthermore, we found that 60.6% of web documents have content exclusively in the English language. Finally, we found that Japanese and traditional Chinese language content dominate the Asian web pages with 1.89% and 1.23% share. To the best of our knowledge, this is the first large scale web study to explore the language mix present in the web documents.

Full Text