Programming abstractions, compilation, and execution techniques for massively parallel data analysis

Stephan Ewen

doi:10.14279/depositonce-4395

Abstract

We are witnessing an explosion in the amount of available data. Today, businesses and scientific institutions have the opportunity to analyze empirical data at unpreceded scale. For many companies, the analysis of their accumulated data is nowadays a key strategic aspect. Today’s analysis programs consist not only of traditional relational-style queries, but they use increasingly more complex data mining and machine learning algorithms to discover hidden patterns or build predictive models. However, with the increasing data volume and increasingly complex questions that people aim to answer, there is a need for new systems that scale to the data size and to the complexity of the queries. Relational Database Management Systems have been the work horses of large-scale data analytics for decades. Their key enabling feature was arguably the declarative query language that brought physical schema independence and automatic optimization of queries. However, their fixed data model and closed set of possible operations have rendered them unsuitable for many advanced analytical tasks. This observation made way for a new breed of systems with generic abstractions for data parallel programming, among which the arguably most famous one is MapReduce. While bringing large-scale analytics to new applications, these systems still lack the ability to express complex data mining and machine learning algorithms efficiently, or they specialize on very specific domains and give up applicability to a wide range of other problems. Compared to relational databases, MapReduce and the other parallel programming systems sacrifice the declarative query abstraction and require programmers to implement low-level imperative programs and to manually optimize them. This thesis discusses techniques that realize several of the key aspects enabling the success of relational databases in the new context of data-parallel programming systems. The techniques are instrumental in building a system for generic and expressive, yet concise, fluent, and declarative analytical programs. Specifically, we present three new methods: First, we provide a programming model that is generic and can deal with complex data models, but retains many declarative aspects of the relational algebra. Programs written against this abstraction can be automatically optimized with similar techniques as relational queries. Second, we present an abstraction for iterative data-parallel algorithms. It supports incremental (delta-based) computations and transparently handles state. We give techniques to make the optimizer iteration-aware and deal with aspects such as loop invariant data. The optimizer can produce execution plans that correspond to well-known hand-optimized versions of such programs. That way, the abstraction subsumes dedicated systems (such as Pregel) and offers competitive performance. Third, we present and discuss techniques to embed the programming abstraction into a functional language. The integration allows for the concise definition of programs and supports the creation of reusable components for libraries or domain-specific languages. We describe how to

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Programming abstractions, compilation, and execution techniques for massively parallel data analysis

Abstract

Talk to us

Similar Papers

Lead the way for us

Similar Papers

The interrelations between intersection, union and other signature operations in table algebra
Alexey Senchenko
Proceedings of the Institute of Applied Mathematics and Mechanics NAS of Ukraine | VOL. 32
Alexey SenchenkoAlexey Senchenko
28 Dec 2018
Proceedings of the Institute of Applied Mathematics and Mechanics NAS of Ukraine | VOL. 32

The interrelations between difference and projection and other signature operations in table algebra
Aleksei Senchenko
Proceedings of the Institute of Applied Mathematics and Mechanics NAS of Ukraine | VOL. 35
Aleksei SenchenkoAleksei Senchenko
28 Jan 2022
Proceedings of the Institute of Applied Mathematics and Mechanics NAS of Ukraine | VOL. 35

An Invigorating Perspective of a Practical Aspect of NoSQL Database
Shaida Begum
International Journal of Research Publication and Reviews | VOL. -
Shaida BegumShaida Begum
19 Sep 2022
International Journal of Research Publication and Reviews | VOL. -

A Deeper Examination of NoSQL Database Models and Characteristics
Shahida Begum
International Journal of Research Publication and Reviews | VOL. -
Shahida BegumShahida Begum
20 Oct 2022
International Journal of Research Publication and Reviews | VOL. -

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Programming abstractions, compilation, and execution techniques for massively parallel data analysis

Abstract

Talk to us

Similar Papers