Abstract

Bidirectional order dependencies (bODs) capture order relationships between lists of attributes in a relational table. They can express that, for example, sorting books by publication date in ascending order also sorts them by age in descending order. The knowledge about order relationships is useful for many data management tasks, such as query optimization, data cleaning, or consistency checking. Because the bODs of a specific dataset are usually not explicitly given, they need to be discovered. The discovery of all minimal bODs (in set-based canonical form) is a task with exponential complexity in the number of attributes, though, which is why existing bOD discovery algorithms cannot process datasets of practically relevant size in a reasonable time. In this paper, we propose the distributed bOD discovery algorithm DISTOD, whose execution time scales with the available hardware. DISTOD is a scalable, robust, and elastic bOD discovery approach that combines efficient pruning techniques for bOD candidates in set-based canonical form with a novel, reactive, and distributed search strategy. Our evaluation on various datasets shows that DISTOD outperforms both single-threaded and distributed state-of-the-art bOD discovery algorithms by up to orders of magnitude; it can, in particular, process much larger datasets.

Highlights

  • Order is a fundamental concept in relational data because every attribute can be used to sort the records of a relation

  • A bidirectional order dependency, such as [A ↑, B ↓] → [C ↑], lets us define the order direction of the individual attributes involved in the Bidirectional order dependencies (bODs); in this example: A in ascending order with ties resolved by B in descending order sorts C in ascending order

  • ODs are closely related to functional dependencies (FDs), which have been extensively studied in research [16], but due to their consideration of order, ODs subsume FDs [28]

Read more

Summary

Introduction

Order is a fundamental concept in relational data because every attribute can be used to sort the records of a relation. Some sortings represent the natural ordering of attribute values by their domain Because a relational instance can follow only one sorting at a time, dependencies between different orders help to find optimal sortings; they reveal meaningful correlations between attribute domains. The attribute values selected by X and Y, respectively This means that ties in the order implied by the first attribute in the list are resolved by the attribute in the list (and so forth). A bidirectional order dependency (bOD), such as [A ↑, B ↓] → [C ↑], lets us define the order direction of the individual attributes involved in the bOD; in this example: A in ascending order with ties resolved by B in descending order sorts C in ascending order. In other words, when we sort the tuples by the ADelay attribute, they are ordered by the ADGrp attribute. Papenbrock in attribute ADGrp of the tuple t5 is greater or equal to the value in ADGrp of tuple t9, but t5’s value in ADelay is smaller than t9’s value in ADelay

Methods
Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call