Abstract

BackgroundGraph-based reference genomes have become popular as they allow read mapping and follow-up analyses in settings where the exact haplotypes underlying a high-throughput sequencing experiment are not precisely known. Two recent papers show that mapping to graph-based reference genomes can improve accuracy as compared to methods using linear references. Both of these methods index the sequences for most paths up to a certain length in the graph in order to enable direct mapping of reads containing common variants. However, the combinatorial explosion of possible paths through nearby variants also leads to a huge search space and an increased chance of false positive alignments to highly variable regions.ResultsWe here assess three prominent graph-based read mappers against a hybrid baseline approach that combines an initial path determination with a tuned linear read mapping method. We show, using a previously proposed benchmark, that this simple approach is able to improve overall accuracy of read-mapping to graph-based reference genomes.ConclusionsOur method is implemented in a tool Two-step Graph Mapper, which is available at https://github.com/uio-bmi/two_step_graph_mapperalong with data and scripts for reproducing the experiments. Our method highlights characteristics of the current generation of graph-based read mappers and shows potential for improvement for future graph-based read mappers.

Highlights

  • Graph-based reference genomes have become popular as they allow read mapping and follow-up analyses in settings where the exact haplotypes underlying a high-throughput sequencing experiment are not precisely known

  • Mapping accuracies are compared using receiver operating characteristic (ROC) curves parameterized by the mapping quality (MAPQ) of all the simulated reads, where each dot in the plot shows the recall and error rate for reads with at least the corresponding MAPQ

  • We suggest that the path-prediction in itself can be achieved by initial rough graph-mapping, and as an example, we use an initial rough graph-mapping method where all the reads first are aligned to the linear reference genome and subsequently locally fitted to the graph

Read more

Summary

Introduction

Graph-based reference genomes have become popular as they allow read mapping and follow-up analyses in settings where the exact haplotypes underlying a high-throughput sequencing experiment are not precisely known. Two recent papers show that mapping to graph-based reference genomes can improve accuracy as compared to methods using linear references Both of these methods index the sequences for most paths up to a certain length in the graph in order to enable direct mapping of reads containing common variants. Grytten et al BMC Genomics (2020) 21:282 more than a day for a human whole-genome graph – Seven Bridges uses a faster approach in which only short kmers (21 base pair sequences at 7 base pair intervals) are indexed This enables indexing of a human whole-genome graph in only minutes. As complex graphs containing many genetic variants can result in long indexing time as well as poor mapping accuracy [3], existing graph-based read mappers ignore the most complex regions in the graph when indexing the graph. Some have proposed to not use graphs, but instead improve the current linear reference genome [13]

Methods
Results
Discussion
Conclusion
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call