Fast Construction of an Index Tree for Large Non-ordered Discrete Datasets Using Multi-way Top-Down Split and MapReduce

Zhichao Zhou,Xiaoqiang Liu,Yinglan Wang,Qiang Zhu

doi:10.1109/cbd.2016.019

Abstract

Effective indexing schemes are crucial in supporting efficient queries on large datasets from multidimensional Non-ordered Discrete Data Spaces (NDDS) in many applications such as genome sequence analysis in bioinformatics. Although constructing an index structure for a large dataset in an NDDS via a bulk loading technique is quite efficient (comparing to using a conventional tuple loading technique), existing bulk loading techniques cannot meet the scalability requirement for the fast growing sizes of datasets in contemporary NDDS applications. To tackle this challenge, we propose a new bulk loading method for fast construction of an index structure, called the PND-tree, for large datasets in NDDSs. Specifically, utilizing the characteristics of an NDDS and a priori knowledge of the given dataset, we suggest an effective multi-way top-down dataset split strategy with a MapReduce implementation for our bulk loading procedure. Experiments demonstrate that the proposed bulk loading method is quite promising in terms of the index construction efficiency and the resulting index quality, comparing to the conventional tuple loading method and a popular serial bulk loading method for a state-of-arts index tree in NDDSs.

Full Text