Design and development of an automated web crawler used for building image databases

Y Kalmukov,I Valova

doi:10.23919/mipro.2019.8756790

Abstract

Every day people worldwide upload millions of images to social networks, personal blogs, community forums and other web-based applications. To increase impact and public popularity however all these images should be indexed by search engines. Building an efficient non-textual search engine is far from a trivial task. It should employ modern information retrieval and image processing techniques to extract, index and store proper metadata from images that allows fast subsequent processing and searching. Before any image processing to apply, the search engine should be able to find and process the large amount of data being constantly added to the Internet. The WWW represents an enormous directed weightless cyclic graph by nature. Blind crawling of such a structure is a pointless waste of time and may never end. To achieve any efficiency, the crawler itself should be able to determine if the current crawling direction is perspective and lead to desired resources or not. Thus calculating weights of graph components (vertices and edges) is absolutely necessary to allow the automated crawling tool to navigate through the web. This paper suggests various ways of calculating weights and proposes architecture of a web crawler designed for building image databases. Choosing the most appropriate search strategy seems to be the key point in building an efficient special purpose web crawler.

Full Text