Label-guided seed-chain-extend alignment on annotated De Bruijn graphs.

Harun Mustafa,Harun Mustafa,Harun Mustafa,André Kahles,Gunnar Rätsch,Gunnar Rätsch,André Kahles,Mikhail Karasikov,Nika Mansouri Ghiasi,Mikhail Karasikov,André Kahles,Mikhail Karasikov,Gunnar Rätsch,André Kahles,Gunnar Rätsch,Gunnar Rätsch,Gunnar Rätsch

doi:10.1093/bioinformatics/btae226

Abstract

Exponential growth in sequencing databases has motivated scalable De Bruijn graph-based (DBG) indexing for searching these data, using annotations to label nodes with sample IDs. Low-depth sequencing samples correspond to fragmented subgraphs, complicating finding the long contiguous walks required for alignment queries. Aligners that target single-labelled subgraphs reduce alignment lengths due to fragmentation, leading to low recall for long reads. While some (e.g. label-free) aligners partially overcome fragmentation by combining information from multiple samples, biologically irrelevant combinations in such approaches can inflate the search space or reduce accuracy. We introduce a new scoring model, 'multi-label alignment' (MLA), for annotated DBGs. MLA leverages two new operations: To promote biologically relevant sample combinations, 'Label Change' incorporates more informative global sample similarity into local scores. To improve connectivity, 'Node Length Change' dynamically adjusts the DBG node length during traversal. Our fast, approximate, yet accurate MLA implementation has two key steps: a single-label seed-chain-extend aligner (SCA) and a multi-label chainer (MLC). SCA uses a traditional scoring model adapting recent chaining improvements to assembly graphs and provides a curated pool of alignments. MLC extracts seed anchors from SCAs alignments, produces multi-label chains using MLA scoring, then finally forms multi-label alignments. We show via substantial improvements in taxonomic classification accuracy that MLA produces biologically relevant alignments, decreasing average weighted UniFrac errors by 63.1%-66.8% and covering 45.5%-47.4% (median) more long-read query characters than state-of-the-art aligners. MLAs runtimes are competitive with label-combining alignment and substantially faster than single-label alignment. The data, scripts, and instructions for generating our results are available at https://github.com/ratschlab/mla.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Label-guided seed-chain-extend alignment on annotated De Bruijn graphs.

Abstract

Talk to us

Similar Papers

More From: Bioinformatics (Oxford, England)

Lead the way for us

Journal: Bioinformatics (Oxford, England)	Publication Date: Jun 28, 2024
License type: cc-by

Similar Papers

Neonates and medicines: a roadmap to further improve neonatal pharmaceutical care.
Karel Allegaert ... Catherine Sherwin
European Journal of Pediatrics | VOL. 175
Karel Allegaert, et. al.Karel Allegaert ... Catherine Sherwin
07 Jan 2016
European Journal of Pediatrics | VOL. 175

Peptide Identification by Database Search of Mixture Tandem Mass Spectra
Jian Wang ... Philip E Bourne
Molecular & Cellular Proteomics | VOL. 10
Jian Wang, et. al.Jian Wang ... Philip E Bourne
23 Aug 2011
Molecular & Cellular Proteomics | VOL. 10

Enhancing classification accuracy of fNIRS-BCI using features acquired from vector-based phase analysis
Hammad Nazeer ... Farzan Majeed Noori
Journal of Neural Engineering | VOL. 17
Hammad Nazeer, et. al.Hammad Nazeer ... Farzan Majeed Noori
01 Oct 2020
Journal of Neural Engineering | VOL. 17

Performance Evaluation of BLAST on SMP Machines
Hong-Soog Kim ... Dong-Soo Han
-
Hong-Soog Kim, et. al.Hong-Soog Kim ... Dong-Soo Han
01 Jan 2006
01 Jan 2006

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Label-guided seed-chain-extend alignment on annotated De Bruijn graphs.

Abstract

Talk to us

Similar Papers

More From: Bioinformatics (Oxford, England)