Shared Memory Machines Research Articles

Fast, lightweight methods for comparing the sequence of ever larger assembled genomes from ever growing databases are increasingly needed in the era of accurate long reads and pan-genome initiatives. Matching statistics is a popular method for computing whole-genome phylogenies and for detecting structural rearrangements between two genomes, since it is amenable to fast implementations that require a minimal setup of data structures. However, current implementations use a single core, take too much memory to represent the result, and do not provide efficient ways to analyze the output in order to explore local similarities between the sequences. We develop practical tools for computing matching statistics between large-scale strings, and for analyzing its values, faster and using less memory than the state-of-the-art. Specifically, we design a parallel algorithm for shared-memory machines that computes matching statistics 30 times faster with 48 cores in the cases that are most difficult to parallelize. We design a lossy compression scheme that shrinks the matching statistics array to a bitvector that takes from 0.8 to 0.2 bits per character, depending on the dataset and on the value of a threshold, and that achieves 0.04 bits per character in some variants. And we provide efficient implementations of range-maximum and range-sum queries that take a few tens of milliseconds while operating on our compact representations, and that allow computing key local statistics about the similarity between two strings. Our toolkit makes construction, storage and analysis of matching statistics arrays practical for multiple pairs of the largest genomes available today, possibly enabling new applications in comparative genomics. Our C/C++ code is available at https://github.com/odenas/indexed_ms under GPL-3.0. The data underlying this article are available in NCBI Genome at https://www.ncbi.nlm.nih.gov/genome and in the International Genome Sample Resource (IGSR) at https://www.internationalgenome.org. Supplementary data are available at Bioinformatics online.

Read full abstract

MotivationThe construction of the compacted de Bruijn graph from collections of reference genomes is a task of increasing interest in genomic analyses. These graphs are increasingly used as sequence indices for short- and long-read alignment. Also, as we sequence and assemble a greater diversity of genomes, the colored compacted de Bruijn graph is being used more and more as the basis for efficient methods to perform comparative genomic analyses on these genomes. Therefore, time- and memory-efficient construction of the graph from reference sequences is an important problem.ResultsWe introduce a new algorithm, implemented in the tool Cuttlefish, to construct the (colored) compacted de Bruijn graph from a collection of one or more genome references. Cuttlefish introduces a novel approach of modeling de Bruijn graph vertices as finite-state automata, and constrains these automata’s state-space to enable tracking their transitioning states with very low memory usage. Cuttlefish is also fast and highly parallelizable. Experimental results demonstrate that it scales much better than existing approaches, especially as the number and the scale of the input references grow. On a typical shared-memory machine, Cuttlefish constructed the graph for 100 human genomes in under 9 h, using 29 GB of memory. On 11 diverse conifer plant genomes, the compacted graph was constructed by Cuttlefish in under 9 h, using 84 GB of memory. The only other tool completing these tasks on the hardware took over 23 h using 126 GB of memory, and over 16 h using 289 GB of memory, respectively.Availability and implementationCuttlefish is implemented in C++14, and is available under an open source license at https://github.com/COMBINE-lab/cuttlefish.Supplementary information Supplementary data are available at Bioinformatics online.

Read full abstract

Shared Memory Machines Research Articles

Related Topics

Articles published on Shared Memory Machines

Parallelization of particle-mass-transfer algorithms on shared-memory, multi-core CPUs

CXL and the Return of Scale-Up Database Engines

A general approach for supporting nonblocking data structures on distributed-memory systems

Parallel Weighted Random Sampling

Computer big data modeling system based on finite element mathematical equation simulation

Fast and compact matching statistics analytics.

An Algorithm for the Sequence Alignment with Gap Penalty Problem using Multiway Divide-and-Conquer and Matrix Transposition

Cuttlefish: fast, parallel and low-memory compaction of de Bruijn graphs from large-scale genome collections.

PBBFMM3D: A parallel black-box algorithm for kernel matrix-vector multiplication

RCHOL: Randomized Cholesky Factorization for Solving SDD Linear Systems

Parallelization of the inverse fast multipole method with an application to boundary element method

Faster and Better Nested Dissection Orders for Customizable Contraction Hierarchies

Computations of permeability of large rock images by dual grid domain decomposition

Parallel computation of Watershed Transform in weighted graphs on shared memory machines

Slurm: Fluid particle-in-cell code for plasma modeling

A two-scale generalized finite element method for parallel simulations of spot welds in large structures

SCALO

Mplrs: A scalable parallel vertex/facet enumeration code

Performance and energy metrics for multi-threaded applications on DVFS processors

Scalable training of 3D convolutional networks on multi- and many-cores

Lead the way for us

Editage

Paperpal

R Discovery

Mind the Graph

Shared Memory Machines Research Articles

Related Topics

Articles published on Shared Memory Machines

Parallelization of particle-mass-transfer algorithms on shared-memory, multi-core CPUs

CXL and the Return of Scale-Up Database Engines

A general approach for supporting nonblocking data structures on distributed-memory systems

Parallel Weighted Random Sampling

Computer big data modeling system based on finite element mathematical equation simulation

Fast and compact matching statistics analytics.

An Algorithm for the Sequence Alignment with Gap Penalty Problem using Multiway Divide-and-Conquer and Matrix Transposition

Cuttlefish: fast, parallel and low-memory compaction of de Bruijn graphs from large-scale genome collections.

PBBFMM3D: A parallel black-box algorithm for kernel matrix-vector multiplication

RCHOL: Randomized Cholesky Factorization for Solving SDD Linear Systems

Parallelization of the inverse fast multipole method with an application to boundary element method

Faster and Better Nested Dissection Orders for Customizable Contraction Hierarchies

Computations of permeability of large rock images by dual grid domain decomposition

Parallel computation of Watershed Transform in weighted graphs on shared memory machines

Slurm: Fluid particle-in-cell code for plasma modeling

A two-scale generalized finite element method for parallel simulations of spot welds in large structures

SCALO

Mplrs: A scalable parallel vertex/facet enumeration code

Performance and energy metrics for multi-threaded applications on DVFS processors

Scalable training of 3D convolutional networks on multi- and many-cores