Sparc: a sparsity-based consensus algorithm for long erroneous sequencing reads.

Chengxi Ye,Zhanshan (Sam) Ma

doi:10.7717/peerj.2016

Abstract

Motivation. The third generation sequencing (3GS) technology generates long sequences of thousands of bases. However, its current error rates are estimated in the range of 15–40%, significantly higher than those of the prevalent next generation sequencing (NGS) technologies (less than 1%). Fundamental bioinformatics tasks such as de novo genome assembly and variant calling require high-quality sequences that need to be extracted from these long but erroneous 3GS sequences.Results. We describe a versatile and efficient linear complexity consensus algorithm Sparc to facilitate de novo genome assembly. Sparc builds a sparse k-mer graph using a collection of sequences from a targeted genomic region. The heaviest path which approximates the most likely genome sequence is searched through a sparsity-induced reweighted graph as the consensus sequence. Sparc supports using NGS and 3GS data together, which leads to significant improvements in both cost efficiency and computational efficiency. Experiments with Sparc show that our algorithm can efficiently provide high-quality consensus sequences using both PacBio and Oxford Nanopore sequencing technologies. With only 30× PacBio data, Sparc can reach a consensus with error rate <0.5%. With the more challenging Oxford Nanopore data, Sparc can also achieve similar error rate when combined with NGS data. Compared with the existing approaches, Sparc calculates the consensus with higher accuracy, and uses approximately 80% less memory and time.Availability. The source code is available for download at https://github.com/yechengxi/Sparc.

Highlights

Three generations of DNA sequencing technologies have been developed in the last three decades, and we are at the crossroads of the second and third generation of the sequencing technologies
Sparc consists of the following four simple steps: (i) Build an initial position specific k-mer graph (Ye et al, 2012) using the draft assembly/backbone sequence. (ii) Align sequences to the backbone to modify the existing graph. (iii) Adjust the edge weights with a sparse penalty. (iv) Search for a heaviest path and output the consensus sequence
While there are platform-specific ones that take into account signal processinglevel information such as Quiver, and Nanopolish, these programs usually take the outputs of the base-level ones as inputs to further improve the accuracy

Summary

Introduction

Three generations of DNA sequencing technologies have been developed in the last three decades, and we are at the crossroads of the second and third generation of the sequencing technologies. With the 3GS data, de novo genome assembly algorithms need to pass through three major bottlenecks: finding overlaps (Berlin et al, 2015; Ye et al, 2014), sequence alignment (Chaisson & Tesler, 2012; Myers, 2014) and sequence polishing/error correction. Correcting these long erroneous reads is a non-trivial problem (Au et al, 2012; Hackl et al, 2014; Koren et al, 2012; Salmela & Rivals, 2014).

Methods

Results

Conclusion