Design of a vertical search engine for synchrotron data: a big data approach using Hadoop ecosystem

Ali Khaleghi,Kamran Mahmoudi,Sonia Mozaffari

doi:10.1007/s42452-019-1582-1

Abstract

A synchrotron as an experimental physics facility can provide the opportunity of a multi-disciplinary research and collaboration between scientists in various fields of study such as physics, chemistry, etc. During the construction and operation of such facility valuable data regarding the design of the facility, instruments and conducted experiments are published and stored. It takes researchers a long time going through different results from generalized search engines to find their needed scientific information so that the design of a domain specific search engine can help researchers to find their desired information with greater precision. It also provides the opportunity to use the crawled data to create a knowledgebase and also to generate different datasets required by the researchers. There have been several other vertical search engines that are designed for scientific data search such as medical information. In this paper we propose the design of such search engine on top of the Apache Hadoop framework. Usage of Hadoop ecosystem provides the necessary features such as scalability, fault tolerance and availability. It also abstracts the complexities of search engine design by using different open source tools as building blocks, among them Apache Nutch for the crawling block and Apache Solr for indexing and query processing. Our primary results obtained by implementing the proposed method in single node mode, the index of over a hundred thousand pages was created with the average fetch interval of 30 days having 28 segments and approximately 570 MB size. The performance factors such as the usage of available bandwidth and system load were logged using Linux’s sysstat package.

Full Text