Abstract

The PacBio® HiFi sequencing method yields highly accurate long-read sequencing datasets with read lengths averaging 10–25 kb and accuracies greater than 99.5%. These accurate long reads can be used to improve results for complex applications such as single nucleotide and structural variant detection, genome assembly, assembly of difficult polyploid or highly repetitive genomes, and assembly of metagenomes. Currently, there is a need for sample data sets to both evaluate the benefits of these long accurate reads as well as for development of bioinformatic tools including genome assemblers, variant callers, and haplotyping algorithms. We present deep coverage HiFi datasets for five complex samples including the two inbred model genomes Mus musculus and Zea mays, as well as two complex genomes, octoploid Fragaria × ananassa and the diploid anuran Rana muscosa. Additionally, we release sequence data from a mock metagenome community. The datasets reported here can be used without restriction to develop new algorithms and explore complex genome structure and evolution. Data were generated on the PacBio Sequel II System.

Highlights

  • Background & SummaryUntil recently, DNA sequencing technologies produced either short highly accurate reads[1,2] or less-accurate long reads (10–100 s of kb at 75–90% accuracy)[3,4]

  • Accurate short reads are appropriate for germline[5] and somatic[6] variant detection, exome sequencing[7], liquid biopsy[8], non-invasive prenatal testing[9], and counting applications such as transcript profiling[10] or single-cell analysis[11]

  • To increase the utility of noisy long-read sequencing, several error correction methods have been devised to improve the accuracy of long reads by combining the data from either multiple independent long-read molecules or combining data from long- and short-read technologies[12,14]

Read more

Summary

Background & Summary

DNA sequencing technologies produced either short highly accurate reads (up to 300 bases at 99% accuracy)[1,2] or less-accurate long reads (10–100 s of kb at 75–90% accuracy)[3,4]. To increase the utility of noisy long-read sequencing, several error correction methods have been devised to improve the accuracy of long reads by combining the data from either multiple independent long-read molecules or combining data from long- and short-read technologies[12,14]. These error-corrected reads can be used for assembly or other downstream applications. Has demonstrated superior assembly and haplotyping results for the human genome as measured by contiguity and accuracy when compared to traditional noisy long- or short-read methods. The data released in this study covers a wide breadth of highly complex plant, animal, and microbial organisms and will provide a useful sequence resource, driving the sequencing standards toward higher quality in the future[25]

Methods
Findings
Code availability
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call