Proximal Policy Optimization-based Join Order Optimization with Spark SQL

Kyeong-Min Lee,Ina Kim,Kyu-Chul Lee

doi:10.5573/ieiespc.2021.10.3.227

Abstract

In a smart grid, massive amounts of data are generated during the production, transmission, and consumption of electricity. Often, complex and varied queries with multiple join and selection operations need to be run on such data. Several studies have focused on improving the performance of query evaluation by applying machine learning techniques to query optimization problems. However, these studies are limited to processing queries for data in a single environment. In this paper, we propose a Proximal Policy Optimization (PPO)-based join order optimization model for use on Spark SQL to improve the retrieval performance for large amounts of data. The model uses the cost computation method of Spark SQL for training with the costs of the join plans generated by the model as rewards. The model can find more join plans with lower costs than the plans that Spark SQL finds because Spark SQL is limited to a low search space. We demonstrate that the proposed model generates join plans with similar or lower costs than Spark SQL without executing the optimization algorithm of Spark SQL.

Full Text