Effective and Efficient Lexicographical Order Dependency Discovery

Jixuan Chen,Yihan Li,Zijing Tan,Shuai Ma,Yifeng Jin,Weidong Yang

doi:10.1109/tkde.2023.3248780

Abstract

Lexicographical order dependencies state relationships of order between lists of attributes. They naturally model the order-by clauses in SQL queries, and are proven useful in query optimizations concerning sorting. Despite their importance, order dependencies on a dataset are typically unknown and are too costly, if not impossible, to design or discover manually. Techniques for automatic order dependency discovery are recently studied. It is challenging for order dependency discovery to scale well, since it is by nature factorial in the number <inline-formula xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink"><tex-math notation="LaTeX">$m$</tex-math></inline-formula> of attributes and quadratic in the number <inline-formula xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink"><tex-math notation="LaTeX">$n$</tex-math></inline-formula> of tuples. In this paper, we adopt a strategy that decouples the impact of <inline-formula xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink"><tex-math notation="LaTeX">$m$</tex-math></inline-formula> from that of <inline-formula xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink"><tex-math notation="LaTeX">$n$</tex-math></inline-formula> , and that still finds all minimal and valid lexicographical order dependencies. We present carefully designed data structures, a host of algorithms and optimizations, and an enhanced strategy combined with multithreaded parallelism, for an efficient implementation. Using a host of real-life and synthetic datasets, we experimentally verify our approach is up to orders of magnitude faster than the state-of-the-art methods, and can deliver better results with an improved definition of minimal attribute lists.

Full Text