Abstract

In the history of databases, eXtensible Markup Language (XML) has been thought of as the standard format to store and exchange semi-structured data. With the advent of IoT, XML technologies can play an important role in addressing the issue of processing a massive amount of data generated from heterogeneous devices. As the number and complexity of such datasets increases there is a need for algorithms which are able to index and retrieve XML data efficiently even for complex queries. In this context twig pattern matching, finding all occurrences of a twig pattern query (TPQ), is a core operation in XML query processing. Until now holistic joins have been considered the state-of-the-art TPQ processing algorithms, but they fail to guarantee an optimal evaluation except at the expense of excessive storage costs which limit their scope in large datasets. In this article, we introduce a new approach which significantly outperforms earlier methods in terms of both the size of the intermediate storage and query running time. The approach presented here uses Child Prime Labels (Alsubai & North, 2018) to improve the filtering phase of bottom-up twig matching algorithms and a novel algorithm which avoids the use of stacks, thus improving TPQs processing efficiency. Several experiments were conducted on common benchmarks such as DBLP, XMark and TreeBank datasets to study the performance of the new approach. Multiple analyses on a range of twig pattern queries are presented to demonstrate the statistical significance of the improvements.

Highlights

  • XML technology has emerged as the de facto standard for storage of semi-structure data and for data exchange in e-business [19]

  • A set of novel bottom-up holistic twig matching algorithms which are based on a new advanced preorder filtering function which has the ability to preserve the document order, unlike previous filtering strategies, such as [30], [32], and filter out irrelevant elements when P-C relationships are invloved in Twig Pattern Query (TPQ)

  • We have presented new approaches that use the Child Prime Label (CPL) indexing to improve filtering phase of bottom-up twig matching algorithms

Read more

Summary

INTRODUCTION

XML technology has emerged as the de facto standard for storage of semi-structure data and for data exchange in e-business [19]. The Child Prime Label (CPL) algorithm is an extension of the getNext() core function in the classical holistic twig joins algorithm, TwigStack [10] This new filtering function can filter out irrelevant elements efficiently without either violating the document order or consuming additional space. A set of novel bottom-up holistic twig matching algorithms which are based on a new advanced preorder filtering function which has the ability to preserve the document order, unlike previous filtering strategies, such as [30], [32], and filter out irrelevant elements when P-C relationships are invloved in TPQs. Full proofs of correctness for the algorithms necessary to evaluate subsets of TPQs containing P-C and A-D axes are provided as well.

RELATED WORK
OPTIMAL TWIG JOINS
TwigPrime
EXPERIMENTAL EVALUATION
Findings
CONCLUSION
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call