Abstract

The length of long reads produced by third-generation sequencing technologies is tens to hundreds of kbps which benefits genomic research. Still, the high error rate of long reads seriously limits the downstream analysis. Only by preserving the length advantage and reducing the error rate of long reads can the effectiveness of the downstream analysis be improved. Here propose LocPatcH: an accurate, efficient, and universal hybrid error correction algorithm based on local machine learning. LocPatcH constructs a profile hidden Markov model for each region in a long read which is aligned with abundant accurate short reads produced by the second-generation sequencing technologies, and then uses the alignment information of the short reads to train the model and finishes the correction. As for the rest of the aligned regions with lower coverage depths, the idea referred to as “patching” is used to complete the correction. The proposed method outperforms mainstream hybrid error correction methods in continuity and memory usage on real Pacbio and Nanopore sequencing datasets.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call