Abstract

In this paper, we study the problem of selectivity estimation on set containment search. Given a query record Q and a record dataset {mathcal {S}}, we aim to accurately and efficiently estimate the selectivity of set containment search of query Q over {mathcal {S}}. We first extend existing distinct value estimating techniques to solve this problem and develop an inverted list and G-KMV sketch-based approach IL-GKMV. We analyze that the performance of IL-GKMV degrades with the increase in vocabulary size. Motivated by limitations of existing techniques and the inherent challenges of the problem, we resort to developing effective and efficient sampling approaches and propose an ordered trie structure-based sampling approach named OT-Sampling. OT-Sampling partitions records based on element frequency and occurrence patterns and is significantly more accurate compared with simple random sampling method and IL-GKMV. To further enhance the performance, a divide-and-conquer-based sampling approach, DC-Sampling, is presented with an inclusion/exclusion prefix to explore the pruning opportunities. Meanwhile, we consider weighted set containment selectivity estimation and devise stratified random sampling approach named StrRS. We theoretically analyze the proposed techniques regarding various accuracy estimators. Our comprehensive experiments on nine real datasets verify the effectiveness and efficiency of our proposed techniques.

Highlights

  • Set-valued attributes are ubiquitous and play an important role in modeling database systems in many applications such as information retrieval, data cleaning, machine learning and user recommendation

  • Considering a query record Q and a collection of records S where a record consists of an identifier and a set of elements, a set containment search retrieves records from S which are contained by Q, i.e., {X|X ∈ S ∧ Q ⊇ X}, where Q contains X (Q ⊇ X ) if all the elements in X are in Q

  • We investigate the problem of selectivity estimation on set containment search

Read more

Summary

Introduction

Set-valued attributes are ubiquitous and play an important role in modeling database systems in many applications such as information retrieval, data cleaning, machine learning and user recommendation. Product descriptions on the preference dataset estimates the total number of users who may be interested in the product and could serve as a prediction of the product’s market potential. In another example, companies may post positions in an online job market Web site where a position description contains a set of required skills. A job seeker may want to have a basic understanding of the job market by obtaining the total number of active job vacancies that he/she perfectly matches (i.e., the skill set of the job seeker contains the required skills of the job)

Challenges
Contributions
Problem Definition
Weighted Set Containment Search
Estimation Measure
KMV Synopses
Random Sampling Approach
IL‐GKMV
Analysis
Our Approach
Trie Structure‐Based Stratified Sampling Approach
Time Complexity
Divide‐and‐Conquer‐Based Sampling Approach
Approximate Divide‐and‐Conquer Algorithm
Compare with OT‐Sampling
Stratified Random Sampling
Query‐Oriented Sampling
Experimental Evaluation
Experimental Setting
Overall Performance
10-2 BOOKC DELIC
Estimation Accuracy Evaluation
Computation Efficiency Evaluation
Searching Set‐Valued Data
Selectivity Estimation
Findings
Conclusion
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.