Abstract

A major challenge to long read sequencing data is their high error rate of up to 15%. We present Ratatosk, a method to correct long reads with short read data. We demonstrate on 5 human genome trios that Ratatosk reduces the error rate of long reads 6-fold on average with a median error rate as low as 0.22 %. SNP calls in Ratatosk corrected reads are nearly 99 % accurate and indel calls accuracy is increased by up to 37 %. An assembly of Ratatosk corrected reads from an Ashkenazi individual yields a contig N50 of 45 Mbp and less misassemblies than a PacBio HiFi reads assembly.

Highlights

  • In Norse mythology, the squirrel Ratatöskr runs up and down the ash tree Yggdrasil, bearing envious words between the eagle at the top and the dragon at the bottom

  • While HG002 and HG003 present a risk of overfitting as the individuals are related, we show in the following that their variant calling accuracy is similar to the one of HG004 which was called with a model trained on an unrelated individual

  • We present Ratatosk, a hybrid error correction tool for noisy genomic long reads designed for accurate variant calling and assembly

Read more

Summary

Introduction

In Norse mythology, the squirrel Ratatöskr runs up and down the ash tree Yggdrasil, bearing envious words between the eagle at the top and the dragon at the bottom. Short read sequencing (SRS) has allowed for the accurate identification of small variants (SNPs and indels) in non-repetitive parts of the genome while long read sequencing (LRS) allows for the characterization of large and complex variations. Oxford Nanopore Technologies (ONT) and Pacific Bioscience (PacBio) are LRS platforms [1] that produce long sequence reads ranging from 103 to 106 bases with an error rate up to 15% [2]. The high error rate of LRS reads is in part compensated by their lengths which increase their mapping accuracy, making LRS suitable for numerous applications in all fields of genomics. LRS used at high coverage on a few individuals [3] or low-medium coverage at population scale [4] greatly improves the detection of structural variants (SVs) because the large size of ONT reads spans SV breakpoints. LRS reads can encompass large sections of highly repetitive regions in the human genome such as centromeres [5], telomeres [6], and tandem repeats [7]

Methods
Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call