An important feature of modern query optimizers is the ability to produce a query plan that is optimal for the underlying data set. This requires the ability to estimate cardinalities and computational costs of intermediate query plan nodes, which is highly dependent on both the query shape and the underlying data distribution. Traditional methods include collecting statistics on base tables and implementing cardinality and computational cost derivation inside the optimizer, which is error-prone for complex query shapes. This paper presents Presto's novel history-based optimization framework (HBO), which collects execution histories and uses them to optimize similar queries in the future. The framework produces accurate estimates for complex query shapes in a lightweight, automated manner, and adapts automatically to changes in underlying data distributions. We present the design and implementation of the HBO framework and provide details on its use in various optimization rules, as well as details on implementing the statistics store on top of a Redis key-value store. We also present the results of running HBO in production in two large data infrastructure organizations (Meta and Uber).
Read full abstract