FLAS: fast and high-throughput algorithm for PacBio long-read self-correction.

Ergude Bao,Fei Xie,Dandan Song,Changjin Song,Bonnie Berger

doi:10.1093/bioinformatics/btz206

Abstract

The third generation PacBio long reads have greatly facilitated sequencing projects with very large read lengths, but they contain about 15% sequencing errors and need error correction. For the projects with long reads only, it is challenging to make correction with fast speed, and also challenging to correct a sufficient amount of read bases, i.e. to achieve high-throughput self-correction. MECAT is currently among the fastest self-correction algorithms, but its throughput is relatively small (Xiao et al., 2017). Here, we introduce FLAS, a wrapper algorithm of MECAT, to achieve high-throughput long-read self-correction while keeping MECAT's fast speed. FLAS finds additional alignments from MECAT prealigned long reads to improve the correction throughput, and removes misalignments for accuracy. In addition, FLAS also uses the corrected long-read regions to correct the uncorrected ones to further improve the throughput. In our performance tests on Escherichia coli, Saccharomyces cerevisiae, Arabidopsis thaliana and human long reads, FLAS can achieve 22.0-50.6% larger throughput than MECAT. FLAS is 2-13× faster compared to the self-correction algorithms other than MECAT, and its throughput is also 9.8-281.8% larger. The FLAS corrected long reads can be assembled into contigs of 13.1-29.8% larger N50 sizes than MECAT. The FLAS software can be downloaded for free from this site: https://github.com/baoe/flas. Supplementary data are available at Bioinformatics online.

Full Text