An effective approach to enhancing a focused crawler using Google

Jae-Gil Lee,Sansung Kim,Donghwan Bae,Jungeun Kim,Mun Yong Yi

doi:10.1007/s11227-019-02787-9

Abstract

In this paper, we share our experience in augmenting a focused crawler of our vertical search engine designed to work with academic slides. The goal of the focused crawler was to collect Microsoft PowerPoint files from academic institutions. A previous approach based on a general web crawler can fail to collect a sufficient number of files mainly because of the robots exclusion protocol and missing hyperlinks. As a remedy to these problems, we propose a combinatory approach in which the indexing information maintained by a general web search engine such as Google is utilized for target URL list generation through our query generator, further then complemented by our URL extractor and file downloader. Because Google has already crawled billions of web pages, it will be more cost-efficient and potentially effective to systematically retrieve the desired information from Google than to redo crawling from scratch by ourselves. Our focused crawler, which we call SlideCrawler, has been used for our vertical search engine CourseShare since the fall of 2011. The capability of SlideCrawler was verified for the top-500 world wide universities. SlideCrawler collected about one million files from the top-500 universities. Further, the study results show that SlideCrawler outperforms Nutch, collecting 3.7 times more slide files.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

An effective approach to enhancing a focused crawler using Google

Abstract

Talk to us

Similar Papers

More From: The Journal of Supercomputing

Lead the way for us

Journal: The Journal of Supercomputing	Publication Date: Feb 20, 2019
Citations: 6

Similar Papers

An Improved Technique for Web Page Classification in Respect of Domain Specific Search
Nidhi Saxena ... Vivek Chandra
International Journal of Computer Applications | VOL. 102
Nidhi Saxena, et. al.Nidhi Saxena ... Vivek Chandra
18 Sep 2014
International Journal of Computer Applications | VOL. 102

Research and Implementation of a Vertical Search Engine in the Financial Domain
...
International Journal of u- and e-Service, Science and Technology | VOL. 7
, et. al. ...
31 Oct 2014
International Journal of u- and e-Service, Science and Technology | VOL. 7

Improving educational web search for question-like queries through subject classification
Tolga Yilmaz ... Özgür Ulusoy
Information Processing & Management | VOL. 56
Tolga Yilmaz, et. al.Tolga Yilmaz ... Özgür Ulusoy
24 Oct 2018
Information Processing & Management | VOL. 56

A simple model of vertical search engines foreclosure
Emanuele Tarantino
Telecommunications Policy | VOL. 37
Emanuele TarantinoEmanuele Tarantino
17 Dec 2012
Telecommunications Policy | VOL. 37

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

An effective approach to enhancing a focused crawler using Google

Abstract

Talk to us

Similar Papers

More From: The Journal of Supercomputing