Burrows-Wheeler Research Articles

Despite the widespread adoption of -mer-based methods in bioinformatics, understanding the influence of -mer sizes remains a persistent challenge. Selecting an optimal -mer size or employing multiple -mer sizes is often arbitrary, application-specific, and fraught with computational complexities. Typically, the influence of -mer size is obscured by the outputs of complex bioinformatics tasks, such as genome analysis, comparison, assembly, alignment, and error correction. However, it is frequently overlooked that every method is built above a well-defined -mer-based object like Jaccard Similarity, de Bruijn graphs, -mer spectra, and Bray-Curtis Dissimilarity. Despite these objects offering a clearer perspective on the role of -mer sizes, the dynamics of -mer-based objects with respect to -mer sizes remain surprisingly elusive. This paper introduces a computational framework that generalizes the transition of -mer-based objects across -mer sizes, utilizing a novel substring index, the Prokrustean graph. The primary contribution of this framework is to compute quantities associated with -mer-based objects for all -mer sizes, where the computational complexity depends solely on the number of maximal repeats and is independent of the range of -mer sizes. For example, counting vertices of compacted de Bruijn graphs for can be accomplished in mere seconds with our substring index constructed on a gigabase-sized read set. Additionally, we derive a space-efficient algorithm to extract the Prokrustean graph from the Burrows-Wheeler Transform. It becomes evident that modern substring indices, mostly based on longest common prefixes of suffix arrays, inherently face difficulties at exploring varying -mer sizes due to their limitations at grouping co-occurring substrings. We have implemented four applications that utilize quantities critical in modern pangenomics and metagenomics. The code for these applications and the construction algorithm is available at https://github.com/KoslickiLab/prokrustean.

Read full abstract

BackgroundMolecular phylogenetics studies the evolutionary relationships among the individuals of a population through their biological sequences. It may provide insights about the origin and the evolution of viral diseases, or highlight complex evolutionary trajectories. A key task is inferring phylogenetic trees from any type of sequencing data, including raw short reads. Yet, several tools require pre-processed input data e.g. from complex computational pipelines based on de novo assembly or from mappings against a reference genome. As sequencing technologies keep becoming cheaper, this puts increasing pressure on designing methods that perform analysis directly on their outputs. From this viewpoint, there is a growing interest in alignment-, assembly-, and reference-free methods that could work on several data including raw reads data.ResultsWe present phyBWT2, a newly improved version of phyBWT (Guerrini et al. in 22nd International Workshop on Algorithms in Bioinformatics (WABI) 242:23–12319, 2022). Both of them directly reconstruct phylogenetic trees bypassing both the alignment against a reference genome and de novo assembly. They exploit the combinatorial properties of the extended Burrows-Wheeler Transform (eBWT) and the corresponding eBWT positional clustering framework to detect relevant blocks of the longest shared substrings of varying length (unlike the k-mer-based approaches that need to fix the length k a priori). As a result, they provide novel alignment-, assembly-, and reference-free methods that build partition trees without relying on the pairwise comparison of sequences, thus avoiding to use a distance matrix to infer phylogeny. In addition, phyBWT2 outperforms phyBWT in terms of running time, as the former reconstructs phylogenetic trees step-by-step by considering multiple partitions, instead of just one partition at a time, as previously done by the latter.ConclusionsBased on the results of the experiments on sequencing data, we conclude that our method can produce trees of quality comparable to the benchmark phylogeny by handling datasets of different types (short reads, contigs, or entire genomes). Overall, the experiments confirm the effectiveness of phyBWT2 that improves the performance of its previous version phyBWT, while preserving the accuracy of the results.

Read full abstract

Burrows-Wheeler Research Articles

Related Topics

Articles published on Burrows-Wheeler

BWT construction and search at the terabase scale.

HAlign 4: A New Strategy for Rapidly Aligning Millions of Sequences.

Integration of BWT scrambling and data compression in an innovative system enhances protection and versatile management of sensor feeds (SEC)

Prokrustean Graph: A substring index for rapid k-mer size analysis.

A survey of BWT variants for string collections.

A Compression and Encryption Based Heart Disease Diagnosis with Deep Learning through ECG Signals

Design and Analysis of Reconfigurable Origami-Based Vacuum Pneumatic Artificial Muscles for Versatile Robotic System.

Constructing and indexing the bijective and extended Burrows–Wheeler transform

Reversible Data Hiding of JPEG Image Based on Adaptive Frequency Band Length

Reversible data hiding of JPEG images based on block sorting and segmented embedding

Concurrent encryption and lossless compression using inversion ranks

μ- PBWT: a lightweight r-indexing of the PBWT for storing and querying UK Biobank data.

PhyBWT2: phylogeny reconstruction via eBWT positional clustering

A new class of string transformations for compressed text indexing

Reversible data hiding for JPEG images based on block difference model and Laplacian distribution estimation

An efficient and secure compression technique for data protection using burrows-wheeler transform algorithm

Adaptive multi-predictor based reversible data hiding with superpixel irregular block sorting and optimization

Two-Cloud Private Read Alignment to a Public Reference Genome

Weighted Burrows–Wheeler Compression

Novel Block Sorting and Symbol Prediction Algorithm for PDE-Based Lossless Image Compression: A Comparative Study with JPEG and JPEG 2000

Lead the way for us

Editage

Paperpal

R Discovery

Mind the Graph

Burrows-Wheeler Research Articles

Related Topics

Articles published on Burrows-Wheeler

BWT construction and search at the terabase scale.

HAlign 4: A New Strategy for Rapidly Aligning Millions of Sequences.

Integration of BWT scrambling and data compression in an innovative system enhances protection and versatile management of sensor feeds (SEC)

Prokrustean Graph: A substring index for rapid k-mer size analysis.

A survey of BWT variants for string collections.

A Compression and Encryption Based Heart Disease Diagnosis with Deep Learning through ECG Signals

Design and Analysis of Reconfigurable Origami-Based Vacuum Pneumatic Artificial Muscles for Versatile Robotic System.

Constructing and indexing the bijective and extended Burrows–Wheeler transform

Reversible Data Hiding of JPEG Image Based on Adaptive Frequency Band Length

Reversible data hiding of JPEG images based on block sorting and segmented embedding

Concurrent encryption and lossless compression using inversion ranks

μ- PBWT: a lightweight r-indexing of the PBWT for storing and querying UK Biobank data.

PhyBWT2: phylogeny reconstruction via eBWT positional clustering

A new class of string transformations for compressed text indexing

Reversible data hiding for JPEG images based on block difference model and Laplacian distribution estimation

An efficient and secure compression technique for data protection using burrows-wheeler transform algorithm

Adaptive multi-predictor based reversible data hiding with superpixel irregular block sorting and optimization

Two-Cloud Private Read Alignment to a Public Reference Genome

Weighted Burrows–Wheeler Compression

Novel Block Sorting and Symbol Prediction Algorithm for PDE-Based Lossless Image Compression: A Comparative Study with JPEG and JPEG 2000