Selectivity Estimation on Set Containment Search

Yang Yang,Ying Zhang,Wenjie Zhang,Xuemin Lin,Liping Wang

doi:10.1007/s41019-019-00104-1

Yang Yang, Ying Zhang + Show 3 more

Open Access

https://doi.org/10.1007/s41019-019-00104-1

Copy DOI

Abstract

In this paper, we study the problem of selectivity estimation on set containment search. Given a query record Q and a record dataset {mathcal {S}}, we aim to accurately and efficiently estimate the selectivity of set containment search of query Q over {mathcal {S}}. We first extend existing distinct value estimating techniques to solve this problem and develop an inverted list and G-KMV sketch-based approach IL-GKMV. We analyze that the performance of IL-GKMV degrades with the increase in vocabulary size. Motivated by limitations of existing techniques and the inherent challenges of the problem, we resort to developing effective and efficient sampling approaches and propose an ordered trie structure-based sampling approach named OT-Sampling. OT-Sampling partitions records based on element frequency and occurrence patterns and is significantly more accurate compared with simple random sampling method and IL-GKMV. To further enhance the performance, a divide-and-conquer-based sampling approach, DC-Sampling, is presented with an inclusion/exclusion prefix to explore the pruning opportunities. Meanwhile, we consider weighted set containment selectivity estimation and devise stratified random sampling approach named StrRS. We theoretically analyze the proposed techniques regarding various accuracy estimators. Our comprehensive experiments on nine real datasets verify the effectiveness and efficiency of our proposed techniques.

Highlights

Set-valued attributes are ubiquitous and play an important role in modeling database systems in many applications such as information retrieval, data cleaning, machine learning and user recommendation
Considering a query record Q and a collection of records S where a record consists of an identifier and a set of elements, a set containment search retrieves records from S which are contained by Q, i.e., {X|X ∈ S ∧ Q ⊇ X}, where Q contains X (Q ⊇ X ) if all the elements in X are in Q
We investigate the problem of selectivity estimation on set containment search

Summary

Introduction

Set-valued attributes are ubiquitous and play an important role in modeling database systems in many applications such as information retrieval, data cleaning, machine learning and user recommendation. Product descriptions on the preference dataset estimates the total number of users who may be interested in the product and could serve as a prediction of the product’s market potential. In another example, companies may post positions in an online job market Web site where a position description contains a set of required skills. A job seeker may want to have a basic understanding of the job market by obtaining the total number of active job vacancies that he/she perfectly matches (i.e., the skill set of the job seeker contains the required skills of the job)

Challenges

Contributions

Problem Definition

Weighted Set Containment Search

Estimation Measure

KMV Synopses

Random Sampling Approach

IL‐GKMV

Analysis

Our Approach

Trie Structure‐Based Stratified Sampling Approach

Time Complexity

Divide‐and‐Conquer‐Based Sampling Approach

Approximate Divide‐and‐Conquer Algorithm

Compare with OT‐Sampling

Stratified Random Sampling

Query‐Oriented Sampling

Experimental Evaluation

Experimental Setting

Overall Performance

10-2 BOOKC DELIC

Estimation Accuracy Evaluation

Computation Efficiency Evaluation

Searching Set‐Valued Data

Selectivity Estimation

Findings

Conclusion

Full Text

Paper version not known

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: Data Science and Engineering	Publication Date: Sep 1, 2019
Citations: 6	License type: open-access

R Discovery Prime

R Discovery Prime

Selectivity Estimation on Set Containment Search

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Data Science and Engineering

Lead the way for us

Similar Papers

DSM-5 Field Trials in the United States and Canada, Part I: Study Design, Sampling Strategy, Implementation, and Analytic Approaches
Diana E Clarke ... William E Narrow
American Journal of Psychiatry | VOL. 170
Diana E Clarke, et. al.Diana E Clarke ... William E Narrow
01 Jan 2013
American Journal of Psychiatry | VOL. 170

A comparison of two sampling approaches for assessing the urban forest canopy cover from aerial photography
Zennure Ucar ... Ramazan Akbulut
Urban Forestry & Urban Greening | VOL. 16
Zennure Ucar, et. al.Zennure Ucar ... Ramazan Akbulut
01 Jan 2015
Urban Forestry & Urban Greening | VOL. 16

Temperate coastal wetland near-surface carbon storage: Spatial patterns and variability
Christopher J Owers ... Colin D Woodroffe
Estuarine, Coastal and Shelf Science | VOL. 235
Christopher J Owers, et. al.Christopher J Owers ... Colin D Woodroffe
09 Jan 2020
Estuarine, Coastal and Shelf Science | VOL. 235

An Inverse-Occurrence Sampling Approach for Urban Flood Susceptibility Mapping
Changpeng Wang ... Wenkai Li
Remote Sensing | VOL. 15
Changpeng Wang, et. al.Changpeng Wang ... Wenkai Li
16 Nov 2023
Remote Sensing | VOL. 15

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Selectivity Estimation on Set Containment Search

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Data Science and Engineering