Understanding the Web

Eda Baykan

doi:10.5075/epfl-thesis-4465

Abstract

The World Wide Web is one of the most widely used information resources. Understanding the web better will enable us to benefit more of it. In this thesis we develop techniques to learn the properties of the web pages like language and topic using only the URLs of web pages. Furthermore we make a comparison and evaluation of web page sampling algorithms to learn about the web properties like content length, top level domain and outdegree distribution. In the first part of this thesis, we develop high performance classifiers for web page language classification using only the URL of web pages. We make a comprehensive study of features and algorithms and test the performance of our classifiers on various real data sets. For language classification the quality of our URL-based classifiers rival the quality of classifiers based on content. Language classification from URL is useful when the content of the web page is not available and when the classification speed is important. Language classifiers based on URLs can be used by crawlers of general and language specific search engines to avoid bandwidth waste. In the second part of this thesis, we investigate whether web page topic classification can be done only with URL. We explore this problem in various dimensions like experimenting with different algorithms, features, data sets and topics. URL-based topic classification is useful when the content of the web page is not available or the content is hidden in images. Topic classifiers based on URLs can be used to filter information and in applications like topic focused crawlers. Although content based topic classifiers give better performances, our URL-based topic classifiers work reasonably well and can be used as a signal to improve the performance of content based classifiers. In the third part of this thesis, we compare the state of art web page sampling algorithms and analyze the samples returned by these algorithms using the web properties like content length, top level domain and outdegree distribution. We discuss the strengths and weaknesses of each algorithm and propose improvements based on experimental results. The sampling algorithms we run on the web are influenced by the structure of the web. We investigate the relationship between the properties of the web and the structure of it. A uniform random sample of the web would be quite useful to learn about the composition and development of the web as it is not possible to download all the web pages to determine the properties of the web.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Understanding the Web

Abstract

Talk to us

Similar Papers

Lead the way for us

Similar Papers

캐럿 단위를 이용한 PC 웹 컨텐츠를 모바일 단말기에 서비스 하는 방법
Dae-Hyuck Park ... Young-Hwan Lim
The KIPS Transactions:PartD | VOL. 14D
Dae-Hyuck Park, et. al.Dae-Hyuck Park ... Young-Hwan Lim
30 Jun 2007
The KIPS Transactions:PartD | VOL. 14D

A Novel Technique for Spare Web Page Detection in Parallel Web Crawler
Gaurav Kumarsrivastav ...
International Journal of Computer Applications | VOL. 94
Gaurav Kumarsrivastav, et. al.Gaurav Kumarsrivastav ...
16 May 2014
International Journal of Computer Applications | VOL. 94

Search Engine Ranking, Quality, and Content of Web Pages That Are Critical Versus Noncritical of Human Papillomavirus Vaccine
Linda Y Fu ... Jill G Joseph
Journal of Adolescent Health | VOL. 58
Linda Y Fu, et. al.Linda Y Fu ... Jill G Joseph
07 Nov 2015
Journal of Adolescent Health | VOL. 58

Digital trance
Yana S Ivashchenko ... Andrey A Ivanov
-
Yana S Ivashchenko, et. al.Yana S Ivashchenko ... Andrey A Ivanov
25 Oct 2019
25 Oct 2019

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Understanding the Web

Abstract

Talk to us

Similar Papers