LotusSQL: SQL engine for high-performance big data systems

Xiaohan Li,Bowen Yu,Haojie Wang,Guanyu Feng,Wenguang Chen

doi:10.26599/bdma.2021.9020009

Abstract

In recent years, Apache Spark has become the de facto standard for big data processing. SparkSQL is a module offering support for relational analysis on Spark with Structured Query Language (SQL). SparkSQL provides convenient data processing interfaces. Despite its efficient optimizer, SparkSQL still suffers from the inefficiency of Spark resulting from Java virtual machine and the unnecessary data serialization and deserialization. Adopting native languages such as C++ could help to avoid such bottlenecks. Benefiting from a bare-metal runtime environment and template usage, systems with C++ interfaces usually achieve superior performance. However, the complexity of native languages also increases the required programming and debugging efforts. In this work, we present LotusSQL, an engine to provide SQL support for dataset abstraction on a native backend Lotus. We employ a convenient SQL processing framework to deal with frontend jobs. Advanced query optimization technologies are added to improve the quality of execution plans. Above the storage design and user interface of the compute engine, LotusSQL implements a set of structured dataset operations with high efficiency and integrates them with the frontend. Evaluation results show that LotusSQL achieves a speedup of up to 9× in certain queries and outperforms Spark SQL in a standard query benchmark by more than 2× on average.

Highlights

The rapid development of information technology has brought significant progress to human society, and the amount of data that computer systems need to deal with has increased
It provides a bridge between relational tables and Resilient Distributed Dataset (RDD), and it can function like a distributed Structured Query Language (SQL) query engine, bringing significant convenience for end-users
SQL, we present two experiments on the standard relational benchmark Transaction Processing Performance Council (TPC)-H[19]

Summary

Introduction

The rapid development of information technology has brought significant progress to human society, and the amount of data that computer systems need to deal with has increased . Spark’s core programming abstraction is in the form of an immutable object collection called Resilient Distributed Dataset (RDD). SparkSQL[7] is designed for processing structured data on Spark. It provides a bridge between relational tables and RDDs, and it can function like a distributed SQL query engine, bringing significant convenience for end-users. The storage module is designed to have a low overhead on the basis of a combination of buffer caches and compact object models. Lotus datasets provide the abstraction of compact collections and efficient operation implementations. Except for distributed allocation, its abstraction is quite similar to Spark’s RDD It adopts a lazy evaluation strategy and supports fault tolerance. Lotus dataset supports primary data types, including int, double, and string. Data are organized into two compact buffers: one as indexes and the other as original characters

Results

Discussion

Conclusion