Spark SQL

Michael Armbrust,Xiangrui Meng,Ali Ghodsi,Reynold S Xin,Joseph K Bradley,Michael J Franklin,Matei Zaharia,Tomer Kaftan,Yin Huai,Cheng Lian,Davies Liu

doi:10.1145/2723372.2742797

Abstract

Spark SQL is a new module in Apache Spark that integrates relational processing with Spark's functional programming API. Built on our experience with Shark, Spark SQL lets Spark programmers leverage the benefits of relational processing (e.g. declarative queries and optimized storage), and lets SQL users call complex analytics libraries in Spark (e.g. machine learning). Compared to previous systems, Spark SQL makes two main additions. First, it offers much tighter integration between relational and procedural processing, through a declarative DataFrame API that integrates with procedural Spark code. Second, it includes a highly extensible optimizer, Catalyst, built using features of the Scala programming language, that makes it easy to add composable rules, control code generation, and define extension points. Using Catalyst, we have built a variety of features (e.g. schema inference for JSON, machine learning types, and query federation to external databases) tailored for the complex needs of modern data analysis. We see Spark SQL as an evolution of both SQL-on-Spark and of Spark itself, offering richer APIs and optimizations while keeping the benefits of the Spark programming model.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Spark SQL

Abstract

Talk to us

Similar Papers

Lead the way for us

Similar Papers

Performance Evaluation of Spark SQL for Batch Processing
K Anusha ... K Usha Rani
-
K Anusha, et. al.K Anusha ... K Usha Rani
01 Jan 2020
01 Jan 2020

SHC: Distributed Query Processing for Non-Relational Data Store
Weiqing Yang ... Mingjie Tang
-
Weiqing Yang, et. al.Weiqing Yang ... Mingjie Tang
01 Apr 2018
01 Apr 2018

Optimization in the catalyst optimizer of Spark SQL
Meenu Chawla ... Vinita Baniwal
TURKISH JOURNAL OF ELECTRICAL ENGINEERING & COMPUTER SCIENCES | VOL. 26
Meenu Chawla, et. al.Meenu Chawla ... Vinita Baniwal
28 Sep 2018
TURKISH JOURNAL OF ELECTRICAL ENGINEERING & COMPUTER SCIENCES | VOL. 26

Indexing for Large Scale Data Querying Based on Spark SQL
Yi Cui ... Daoyuan Wang
-
Yi Cui, et. al.Yi Cui ... Daoyuan Wang
01 Nov 2017
01 Nov 2017

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Spark SQL

Abstract

Talk to us

Similar Papers