Abstract

Trie is one of the most common data structures for string storage and retrieval. As a fast and efficient implementation of trie, double array (DA) can effectively compress strings to reduce storage spaces. However, this method suffers from the problem of low index construction efficiency. To address this problem, we design a two-level partition (TLP) framework in this paper. We first divide the dataset is into smaller lower-level partitions, and then we merge these partitions into bigger upper-level partitions using a min-heap based greedy merging algorithm (MH-GMerge). TLP has an excellent characteristic of load balancing and can be easily parallelized. We implemented two efficient parallel partitioned DAs based on TLP. Extensive experiments were carried out, and the results showed that the proposed methods can significantly improve the construction efficiency of DA and can achieve a better trade-off between construction and retrieval performance than the existing state-of-the-art methods.

Highlights

  • String storage and retrieval are fundamental operations in many fields, such as in search engine, natural language processing, and artificial intelligence applications

  • Extensive experiments show that our proposed indexes can significantly improve construction efficiency of double array (DA) and outperform some other state-of-the-art competitors in many aspects

  • There are two common partitioning strategies available for two characters may contend for a single position in DA, leading to position competition for parallelTwo string processing: Balanced possible collisions are shown below.(BP) and Balanced Partition with Partition Line

Read more

Summary

Introduction

String storage and retrieval are fundamental operations in many fields, such as in search engine, natural language processing, and artificial intelligence applications. Just as B+ -Tree is the representative of database index for integer [1], trie is one of the most common structures for string storage and retrieval and is extensively used in artificial intelligence [2,3], natural language processing [4], data mining [5], IP address searching [6,7], string similarity joining [8,9], and many other fields. The linked form is efficient in space overheads, but its retrieval efficiency is relatively slow. Both of them are difficult to balance between retrieval performance and storage overheads. Level-Ordered Unary Degree Sequence (LOUDS) [18,19] and double array (DA) [20] are the two most

Objectives
Methods
Findings
Conclusion
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.