A Semantic Focused Web Crawler Based on a Knowledge Representation Schema

Julio Hernandez,Miguel Morales-Sandoval,Heidy M Marin-Castro

doi:10.3390/app10113837

Abstract

The Web has become the main source of information in the digital world, expanding to heterogeneous domains and continuously growing. By means of a search engine, users can systematically search over the web for particular information based on a text query, on the basis of a domain-unaware web search tool that maintains real-time information. One type of web search tool is the semantic focused web crawler (SFWC); it exploits the semantics of the Web based on some ontology heuristics to determine which web pages belong to the domain defined by the query. An SFWC is highly dependent on the ontological resource, which is created by domain human experts. This work presents a novel SFWC based on a generic knowledge representation schema to model the crawler’s domain, thus reducing the complexity and cost of constructing a more formal representation as the case when using ontologies. Furthermore, a similarity measure based on the combination of the inverse document frequency (IDF) metric, standard deviation, and the arithmetic mean is proposed for the SFWC. This measure filters web page contents in accordance with the domain of interest during the crawling task. A set of experiments were run over the domains of computer science, politics, and diabetes to validate and evaluate the proposed novel crawler. The quantitative (harvest ratio) and qualitative (Fleiss’ kappa) evaluations demonstrate the suitability of the proposed SFWC to crawl the Web using a knowledge representation schema instead of a domain ontology.

Highlights

According to the website Live Stats [1], there are more than one billion of active websites on theWorld Wide Web (WWW)
Uniform Resource Locator (URL) retrieved from the Google search engine and (ii) seed URLs selected from the built corpus (Wikipedia category)
The second column is associated with the number of seed URLs retrieved from the Google search engine and Wikipedia

Summary

Introduction

According to the website Live Stats [1], there are more than one billion of active websites on theWorld Wide Web (WWW). The increasing necessity of faster and reliable tools to effectively search and retrieve web pages from a particular domain has been gaining importance. One of the most popular tools to systematically collect web pages from the WWW are web crawlers. A web crawler is a system based on Uniform Resource Locator (URL) indexing to traverse the Web. URLs indexing provides a better service to web search engines and similar applications to retrieve resources from the web [2]. The web crawler searches for any URL reachable from the web page being retrieved by the search engine. Each URL found by the crawler is placed in a search queue to later be accessed by the search engine. The process repeats for each new URL retrieved from the queue. The stop criterion for URL searching varies; the most common is until reaching a threshold in the number of URLs retrieved from a seed or when reaching a level of depth

Methods

Results

Discussion

Conclusion