Web Mining and Search Engines

Parteek Bhatia

doi:10.1017/9781108635592.012

Abstract

Chapter Objectives ✓ To understand what is meant by web mining and its types ✓ To understand the working of the HITS algorithm ✓ To know the brief history of search engines ✓ To understand a search engine's architecture and its working ✓ To understand the PageRank algorithm and its working ✓ To understand the concepts of precision and recall Introduction Since Berners-Lee (inventor of the World Wide Web) created the first web page in 1991, there has been an exponential growth in the number of websites worldwide. As of 2018, there were 1.8 billion websites in the world. This growth has been accompanied with another exponential increase in the amount of data available and the need to organize this data in order to extract useful information from it. Early attempts to organize such data included creation of web directories to group together similar web pages. The web pages in these directories were often manually reviewed and tagged based on keywords. As time passed by, search engines became available which employed a variety of techniques in order to extract the required information from the web pages. These techniques are called web mining. Formally, web mining is the application of data mining techniques and machine learning to find useful information from the data present in web pages. Web mining is divided into three parts, i.e. web content mining, structure mining, and usage mining as shown in Figure 11.1. We will discuss each type of web mining in brief. Web Content Mining Web content mining deals with extracting relevant knowledge from the contents of a web page. During content mining, we totally ignore how other web pages link to a given web page or how users interact with it. A trivial approach to web content mining is based on location and frequency of keywords. But this gives rise to two problems: first, the problem of scarcity and second, the problem of abundance. The problem of scarcity occurs with those queries that either generate a few results or no results at all. The problem of abundance occurs with the queries that generate too many search results. The root cause of both the problems is the nature of data present on the web. The data is usually present in the form of HTML which is semi-structured and useful information is generally scattered across multiple web pages.

Full Text