Containerized execution of UDFs

Karla Saur,Tara Mirmira,Jesús Camacho-Rodríguez,Konstantinos Karanasos

doi:10.14778/3551793.3551860

Abstract

User-defined functions (UDFs) have long been used as the de facto way to extend the capabilities of data management systems. However, they are restricted to the specificities of each DBMS, and recent demands for advanced analytics have increased the need for complex UDFs that may require execution of arbitrary computation written in any programming language, management of library dependencies, portability across environments and engines, and resource isolation. These requirements go beyond what traditional UDFs were designed for, and have given rise to containerized UDFs that enable encapsulation and portability. However, this approach is nascent and can result in significant performance penalties and usability issues. In this paper, we present the first study that spans all stages of containerized UDFs' life cycle, performance bottlenecks in their execution, and extensibility to support different engines. Our experiments show that the performance of containerized UDF execution can be greatly affected by system design choices and that there are many trade-offs to consider. For example, regarding the method of communication with the containerized UDF, we show that binary-based implementations minimize overheads and are more than 2.4x faster than widely used text-based ones. Adopting a newer general-purpose communication method such as Arrow Flight can improve performance dramatically, causing a minimal ~10% slowdown compared to non-containerized UDFs. Additionally, containerized UDF start times vary wildly due to program size and complexity, from .07s to 7s in our experiments. Our insights can help DBMS developers make appropriate choices based on individual use cases when designing their systems.

Full Text