Abstract

In many fields, recent years have brought a sharp rise in the size of the data to be analyzed and the complexity of the analysis to be performed. Such analyses are often described as dataflows specified in declarative dataflow languages. A key technique to achieve scalability for such analyses is the optimization of the declarative programs; however, many real-life dataflows are dominated by user-defined functions (UDFs) to perform, for instance, text analysis, graph traversal, classification, or clustering. This calls for specific optimization techniques as the semantics of such UDFs are unknown to the optimizer. In this article, we survey techniques for optimizing dataflows with UDFs. We consider methods developed over decades of research in relational database systems as well as more recent approaches spurred by the popularity of Map/Reduce-style data processing frameworks. We present techniques for syntactical dataflow modification, approaches for inferring semantics and rewrite options for UDFs, and methods for dataflow transformations both on the logical and the physical levels. Furthermore, we give a comprehensive overview on declarative dataflow languages for Big Data processing systems from the perspective of their build-in optimization techniques. Finally, we highlight open research challenges with the intention to foster more research into optimizing dataflows that contain UDFs.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call