Indexing for Large Scale Data Querying Based on Spark SQL

Yi Cui,Guoqiang Li,Daoyuan Wang,Hao Cheng

doi:10.1109/icebe.2017.25

Abstract

Spark SQL lets spark programmers query structured inside Spark programs using SQL statements. It provides spark programmers with great convenience to leverage the benefits of relational processing, and its internal RDD distributed processing also accelerates query on large sets. However, Spark SQL is not designed for long-run services and its built-in source would load from storage system, such as HDFS and local file system, in each table scan without mechanism. Although users could keep in memory using cache command explicitly, the cached in memory is coarse grained. In this paper, we present an indexing structure which is a pluggable component of Spark SQL based on Apache Spark. Compared with Spark SQL, it has some additional advantages. Firstly, it allows users to create index of structured to be processed, which speeds up the query performance greatly. Secondly, it enables programmers to load fine-grained file of structured into memory, which is flexible to load hot data into memory and to evict cold data out of memory.

Full Text