ScalaGiST

Peng Lu,Sai Wu,Hoang Tam Vo,Beng Chin Ooi,Gang Chen

doi:10.14778/2733085.2733087

Abstract

MapReduce has become the state-of-the-art for data parallel processing. Nevertheless, Hadoop, an open-source equivalent of MapReduce, has been noted to have sub-optimal performance in the database context since it is initially designed to operate on raw data without utilizing any type of indexes. To alleviate the problem, we present ScalaGiST - scalable generalized search tree that can be seamlessly integrated with Hadoop, together with a cost-based data access optimizer for efficient query processing at run-time. ScalaGiST provides extensibility in terms of data and query types, hence is able to support unconventional queries (e.g., multi-dimensional range and k -NN queries) in MapReduce systems, and can be dynamically deployed in large cluster environments for handling big users and data. We have built ScalaGiST and demonstrated that it can be easily instantiated to common B + -tree and R-tree indexes yet for dynamic distributed environments. Our extensive performance study shows that ScalaGiST can provide efficient write and read performance, elastic scaling property, as well as effective support for MapReduce execution of ad-hoc analytic queries. Performance comparisions with recent proposals of specialized distributed index structures, such as SpatialHadoop, Data Mapping, and RT-CAN further confirm its efficiency.

Full Text