Hadoop vs. Spark: Impact on Performance of the Hammer Query Engine for Open Data Corpora

Mauro Pelucchi,Maurizio Toccu,Giuseppe Psaila

doi:10.3390/a11120209

Abstract

The Hammer prototype is a query engine for corpora of Open Data that provides users with the concept of blind querying. Since data sets published on Open Data portals are heterogeneous, users wishing to find out interesting data sets are blind: queries cannot be fully specified, as in the case of databases. Consequently, the query engine is responsible for rewriting and adapting the blind query to the actual data sets, by exploiting lexical and semantic similarity. The effectiveness of this approach was discussed in our previous works. In this paper, we report our experience in developing the query engine. In fact, in the very first version of the prototype, we realized that the implementation of the retrieval technique was too slow, even though corpora contained only a few thousands of data sets. We decided to adopt the Map-Reduce paradigm, in order to parallelize the query engine and improve performances. We passed through several versions of the query engine, either based on the Hadoop framework or on the Spark framework. Hadoop and Spark are two very popular frameworks for writing and executing parallel algorithms based on the Map-Reduce paradigm. In this paper, we present our study about the impact of adopting the Map-Reduce approach and its two most famous frameworks to parallelize the Hammer query engine; we discuss various implementations of the query engine, either obtained without significantly rewriting the algorithm or obtained by completely rewriting the algorithm by exploiting high level abstractions provided by Spark. The experimental campaign we performed shows the benefits provided by each studied solution, with the perspective of moving toward Big Data in the future. The lessons we learned are collected and synthesized into behavioral guidelines for developers approaching the problem of parallelizing algorithms by means of Map-Reduce frameworks.

Highlights

IntroductionIn [1], we started a research project whose aim is to develop a technique for blind querying corpora of Open Data
Open Data portals have become tools widely adopted by public administrations to diffuse data sets concerning territories and governments; since these data sets are publicly available to anybody, they are called “open”.In [1], we started a research project whose aim is to develop a technique for blind querying corpora of Open Data
The idea of blind querying is motivated by the fact that a corpus of Open Data possibly contains thousands of data sets, each one with its own structure that is unknown to users wishing to look for interesting data sets

Summary

Introduction

In [1], we started a research project whose aim is to develop a technique for blind querying corpora of Open Data. The idea of blind querying is motivated by the fact that a corpus of Open Data possibly contains thousands of data sets, each one with its own structure that is unknown to users wishing to look for interesting data sets. Algorithms 2018, 11, 209 feature-rich search engines and more or less give the same performance; Apache Solr is recommended for text-oriented search engines and Elasticsearch is better to handle analytical queries. Apart from better recall and precision we obtained with our technique, the main difference is that our technique does not have to index (possibly huge) instances of data sets, but only meta-data: instances are downloaded only in the final phase of retrieval, at query time.

Objectives

Methods

Results

Conclusion