The efficiency of query processing in the Spark SQL big data processing engine is significantly affected by execution plans and allocated resources. However, existing cost models for Spark SQL rely on hand-crafted rules. While learning-based cost models have been proposed for relational databases, they do not consider available resources. To address this issue, we propose a resource-aware deep learning model capable of automatically predicting query plan execution times based on historical data. To train our model, we embed query execution plans within a query plan tree and extracted features from allocated resources. An adaptive attention mechanism is integrated into the deep learning model to enhance prediction accuracy. Additionally, we extract sufficient features to represent data information and learn the effect of the data on query execution. This approach reduces the need for model retraining owing to data changes. The experimental results demonstrate that our deep cost model outperforms traditional rule-based methods and relational database learning-based optimizers in predicting query plan execution times.
Read full abstract