Estimating the Cardinality of Queries Based on a Sample from a Full Outer Join of Tables

U A Grigorev

doi:10.17587/it.29.650-663

Abstract

Cardinality estimation (CardEst) plays an important role in creating high-quality query execution plans in the DBMS. In the last decade, a large number of methods have been developed: traditional methods (histograms, samples), machine learning methods based on queries or data. But all of them are based on different restrictions and assumptions, and the cardinality estimation with their help worsens with an increase in the number of joined tables. The article proposes two new methods based on the theory of approximate calculation of aggregates and allowing to remove most of the restrictions. Method 1 fetches blocks after executing the subqueries of the original query and does not require a preliminary analysis of the filter conditions. The condition for joining tables can be arbitrary (not necessarily the equality of attributes). Method 2 allows you to calculate the probabilities of reading blocks based on the metadata accumulated in the process of populating the database. Metadata takes up little memory, and the overhead of maintaining it is low. Method 2, in contrast to method 1, takes into account the true cardinality of the connection of neighboring blocks in the sample obtained from metadata. Therefore, the prospect opens up for a more accurate assessment of the cardinality of joining a large number of tables. Method 1 (EVACAR method) is implemented and compared with modern machine learning methods BayesCard, DeepDB, FLAT on a special STATS test. The results of the experiments confirmed the effectiveness of the EVACAR method. The EVACAR method is more accurate or its maximum q-error is comparable to machine learning methods for 75-88 % of evaluated queries (subplans). In the future, it is planned to implement the 2nd method for assessing the cardinality of queries.

Full Text