Suffix Array Research Articles

Many real-world problems require a sorting operation as part of their efficient solution. Some examples of this are real-time plasma diagnostic, image re-ranking, and suffix array construction. These problems usually involve a large amount of data, so their solutions need a particular application of the sorting procedure, consisting of sorting several arrays in matrix rows or array segments, an operation called segmented sorting. Previous studies showed that a merge sort-based strategy and a strategy called fix sort executed this operation on GPUs with good performance for different array sizes. In this work, we compare the fastest segmented sorting GPU implementations on seven different GPU models with various input data scenarios, including scenarios with varying numbers of segments, segment sizes, and considering segments of the same and different sizes. We first performed algorithm analysis to explain how the number of segments affects each implementation’s performance. Then, we perform an S-curve analysis and observe that, even though each strategy might be the fastest option for a subset of the sorting scenarios, some approaches may cause very high slowdowns on specific scenarios. We also compare the strategies using heat maps, show that their performance depends on the array size and number of segments, and propose a recommendation map to support selecting the best overall implementation based on the size and number of segments. Our experimental results show that choosing a strategy based on our recommendation map leads to the best strategy on 47.57% of the cases and a maximum slowdown of less than 1.5 times in 93.58% of the cases. Moreover, on average, the recommended strategy is only 1.11× worse than the optimum one.Finally, we evaluated how each strategy behaves when sorting arrays with different and equal segment sizes and showed that the fix sort-based approaches take roughly the same time to sort arrays with equal or different segment sizes, while the approach proposed by Hou et al. usually takes longer to sort arrays with different segment sizes than arrays with equal segment sizes.

Read full abstract

Given an input string, the Burrows-Wheeler Transform (BWT) can be seen as a reversible permutation of it that allows efficient compression and fast substring queries. Due to these properties, it has been widely applied in the analysis of genomic sequence data, enabling important tasks such as read alignment. Mantaci et al. [TCS2007] extended the notion of the BWT to a collection of strings by defining the extended Burrows-Wheeler Transform (eBWT). This definition requires no modification of the input collection, and has the property that the output is independent of the order of the strings in the collection. However, over the years, the term eBWT has been used more generally to describe any BWT of a collection of strings. The fundamental property of the original definition (i.e., the independence from the input order) is frequently disregarded. In this paper, we propose a simple linear-time algorithm for the construction of the original eBWT, which does not require the preprocessing of Bannai et al. [CPM 2021]. As a byproduct, we obtain the first linear-time algorithm for computing the BWT of a single string that uses neither an end-of-string symbol nor Lyndon rotations. We also combine our new eBWT construction with a variation of prefix-free parsing (PFP) [WABI 2019] to allow for construction of the eBWT on large collections of genomic sequences. We implement this combined algorithm (pfpebwt) and evaluate it on a collection of human chromosomes 19 from the 1,000 Genomes Project, on a collection of Salmonella genomes from GenomeTrakr, and on a collection of SARS-CoV2 genomes from EBI's COVID-19 data portal. We demonstrate that pfpebwt is the fastest method for all collections, with a maximum speedup of 7.6x on the second best method. The peak memory is at most 2x larger than the second best method. Comparing with methods that are also, as our algorithm, able to report suffix array samples, we obtain a 57.1x improvement in peak memory. The source code is publicly available at https://github.com/davidecenzato/PFP-eBWT.

Read full abstract

Suffix Array Research Articles

Related Topics

Articles published on Suffix Array

BWA-MEME: BWA-MEM emulated with a machine learning approach.

A study for extracting keywords from data with deep learning and suffix array

An evaluation of fast segmented sorting implementations on GPUs

SLDMS: A Tool for Calculating the Overlapping Regions of Sequences.

SaAlign: Multiple DNA/RNA sequence alignment and phylogenetic tree construction tool for ultra-large datasets and ultra-long sequences based on suffix array

Dfinder—An efficient differencing algorithm for incremental programming of constrained IoT devices

An optimized FM-index library for nucleotide and amino acid search

Optimal in-place suffix sorting

Full-text search engine with suffix index for massive heterogeneous data

The exact multiple pattern matching problem solved by a reference tree approach

Computation of the suffix array, Burrows-Wheeler transform and FM-index in V-order

Towards Dynamic Verifiable Pattern Matching

Faster repetition-aware compressed suffix trees based on Block Trees

Suffix array for multi-pattern matching with variable length wildcards

Building and Checking Suffix Array Simultaneously by Induced Sorting Method

Designing efficient algorithms for querying large corpora

Towards a Complete Perspective on Labeled Tree Indexing: New Size Bounds, Efficient Constructions, and Beyond

Computing the original eBWT faster, simpler, and with less memory.

R-indexing the eBWT.

Computing Maximal Lyndon Substrings of a String

Lead the way for us

Editage

Paperpal

R Discovery

Mind the Graph

Suffix Array Research Articles

Related Topics

Articles published on Suffix Array

BWA-MEME: BWA-MEM emulated with a machine learning approach.

A study for extracting keywords from data with deep learning and suffix array

An evaluation of fast segmented sorting implementations on GPUs

SLDMS: A Tool for Calculating the Overlapping Regions of Sequences.

SaAlign: Multiple DNA/RNA sequence alignment and phylogenetic tree construction tool for ultra-large datasets and ultra-long sequences based on suffix array

Dfinder—An efficient differencing algorithm for incremental programming of constrained IoT devices

An optimized FM-index library for nucleotide and amino acid search

Optimal in-place suffix sorting

Full-text search engine with suffix index for massive heterogeneous data

The exact multiple pattern matching problem solved by a reference tree approach

Computation of the suffix array, Burrows-Wheeler transform and FM-index in V-order

Towards Dynamic Verifiable Pattern Matching

Faster repetition-aware compressed suffix trees based on Block Trees

Suffix array for multi-pattern matching with variable length wildcards

Building and Checking Suffix Array Simultaneously by Induced Sorting Method

Designing efficient algorithms for querying large corpora

Towards a Complete Perspective on Labeled Tree Indexing: New Size Bounds, Efficient Constructions, and Beyond

Computing the original eBWT faster, simpler, and with less memory.

R-indexing the eBWT.

Computing Maximal Lyndon Substrings of a String