S ilk M oth

Dong Deng,Samuel Madden,Michael Stonebraker,Albert Kim

doi:10.14778/3115404.3115413

S ilk M oth

Dong Deng, Samuel Madden + Show 2 more

Open Access

https://doi.org/10.14778/3115404.3115413

Copy DOI

Abstract

Determining if two sets are related - that is, if they have similar values or if one set contains the other -- is an important problem with many applications in data cleaning, data integration, and information retrieval. For example, set relatedness can be a useful tool to discover whether columns from two different databases are joinable; if enough of the values in the columns match, it may make sense to join them. A common metric is to measure the relatedness of two sets by treating the elements as vertices of a bipartite graph and calculating the score of the maximum matching pairing between elements. Compared to other metrics which require exact matchings between elements, this metric uses a similarity function to compare elements between the two sets, making it robust to small dissimilarities in elements and more useful for real-world, dirty data. Unfortunately, the metric suffers from expensive computational cost, taking O ( n 3 ) time, where n is the number of elements in the sets, for each set-to-set comparison. Thus for applications that try to search for all pairings of related sets in a brute-force manner, the runtime becomes unacceptably large. To address this challenge, we developed S ilk M oth , a system capable of rapidly discovering related set pairs in collections of sets. Internally, S ilk M oth creates a signature for each set, with the property that any other set which is related must match the signature. S ilk M oth then uses these signatures to prune the search space, so only sets that match the signatures are left as candidates. Finally, S ilk M oth applies the maximum matching metric on remaining candidates to verify which of these candidates are truly related sets. An important property of S ilk M oth is that it is guaranteed to output exactly the same related set pairings as the brute-force method, unlike approximate techniques. Thus, a contribution of this paper is the characterization of the space of signatures which enable this property. We show that selecting the optimal signature in this space is NP-complete, and based on insights from the characterization of the space, we propose two novel filters which help to prune the candidates further before verification. In addition, we introduce a simple optimization to the calculation of the maximum matching metric itself based on the triangle inequality. Compared to related approaches, S ilk M oth is much more general, handling a larger space of similarity functions and relatedness metrics, and is an order of magnitude more efficient on real datasets.

Full Text