Longest Common Prefix Array Research Articles

BackgroundA large number of researchers have devoted to accelerating the speed of genome sequencing and reducing the cost of genome sequencing for decades, and they have made great strides in both areas, making it easier for researchers to study and analyze genome data. However, how to efficiently store and transmit the vast amount of genome data generated by high-throughput sequencing technologies has become a challenge for data compression researchers. Therefore, the research of genome data compression algorithms to facilitate the efficient representation of genome data has gradually attracted the attention of these researchers. Meanwhile, considering that the current computing devices have multiple cores, how to make full use of the advantages of the computing devices and improve the efficiency of parallel processing is also an important direction for designing genome compression algorithms.ResultsWe proposed an algorithm (LMSRGC) based on reference genome sequences, which uses the suffix array (SA) and the longest common prefix (LCP) array to find the longest matched substrings (LMS) for the compression of genome data in FASTA format. The proposed algorithm utilizes the characteristics of SA and the LCP array to select all appropriate LMSs between the genome sequence to be compressed and the reference genome sequence and then utilizes LMSs to compress the target genome sequence. To speed up the operation of the algorithm, we use GPUs to parallelize the construction of SA, while using multiple threads to parallelize the creation of the LCP array and the filtering of LMSs.ConclusionsExperiment results demonstrate that our algorithm is competitive with the current state-of-the-art algorithms in compression ratio and compression time.

Read full abstract

BackgroundSequencing technologies produce larger and larger collections of biosequences that have to be stored in compressed indices supporting fast search operations. Many compressed indices are based on the Burrows–Wheeler Transform (BWT) and the longest common prefix (LCP) array. Because of the sheer size of the input it is important to build these data structures in external memory and time using in the best possible way the available RAM.ResultsWe propose a space-efficient algorithm to compute the BWT and LCP array for a collection of sequences in the external or semi-external memory setting. Our algorithm splits the input collection into subcollections sufficiently small that it can compute their BWT in RAM using an optimal linear time algorithm. Next, it merges the partial BWTs in external or semi-external memory and in the process it also computes the LCP values. Our algorithm can be modified to output two additional arrays that, combined with the BWT and LCP array, provide simple, scan-based, external memory algorithms for three well known problems in bioinformatics: the computation of maximal repeats, the all pairs suffix–prefix overlaps, and the construction of succinct de Bruijn graphs.ConclusionsWe prove that our algorithm performs {mathcal {O}}(n, mathsf {maxlcp}) sequential I/Os, where n is the total length of the collection and mathsf {maxlcp} is the maximum LCP value. The experimental results show that our algorithm is only slightly slower than the state of the art for short sequences but it is up to 40 times faster for longer sequences or when the available RAM is at least equal to the size of the input.

Read full abstract

Longest Common Prefix Array Research Articles

Related Topics

Articles published on Longest Common Prefix Array

Reference-based genome compression using the longest matched substrings with parallelization consideration

Computing matching statistics on Wheeler DFAs.

Space-time Trade-offs for the LCP Array of Wheeler DFAs.

String inference from longest-common-prefix array

Computing the multi-string BWT and LCP array in external memory

Gsufsort: constructing suffix arrays, LCP arrays and BWTs for string collections

An efficient algorithm for identifying (ℓ, d) motif from huge DNA datasets

Multithread Multistring Burrows-Wheeler Transform and Longest Common Prefix Array.

External memory BWT and LCP computation for sequence collections with applications

Better External Memory LCP Array Construction

Extended suffix array construction using Lyndon factors

Checking Big Suffix and LCP Arrays by Probabilistic Methods

Burrows–Wheeler transform and LCP array construction in constant space

LCP Array Construction in External Memory

Tighter bounds for the sum of irreducible LCP values

Computing the Longest Previous Factor

Parameterized longest previous factor

Dynamic extended suffix arrays

Lead the way for us

Editage

Paperpal

R Discovery

Mind the Graph

Longest Common Prefix Array Research Articles

Related Topics

Articles published on Longest Common Prefix Array

Reference-based genome compression using the longest matched substrings with parallelization consideration

Computing matching statistics on Wheeler DFAs.

Space-time Trade-offs for the LCP Array of Wheeler DFAs.

String inference from longest-common-prefix array

Computing the multi-string BWT and LCP array in external memory

Gsufsort: constructing suffix arrays, LCP arrays and BWTs for string collections

An efficient algorithm for identifying (ℓ, d) motif from huge DNA datasets

Multithread Multistring Burrows-Wheeler Transform and Longest Common Prefix Array.

External memory BWT and LCP computation for sequence collections with applications

Better External Memory LCP Array Construction

Extended suffix array construction using Lyndon factors

Checking Big Suffix and LCP Arrays by Probabilistic Methods

Burrows–Wheeler transform and LCP array construction in constant space

LCP Array Construction in External Memory

Tighter bounds for the sum of irreducible LCP values

Computing the Longest Previous Factor

Parameterized longest previous factor

Dynamic extended suffix arrays