Abstract

BackgroundWith the rapid development of long-read sequencing technologies, it is possible to reveal the full spectrum of genetic structural variation (SV). However, the expensive cost, finite read length and high sequencing error for long-read data greatly limit the widespread adoption of SV calling. Therefore, it is urgent to establish guidance concerning sequencing coverage, read length, and error rate to maintain high SV yields and to achieve the lowest cost simultaneously.ResultsIn this study, we generated a full range of simulated error-prone long-read datasets containing various sequencing settings and comprehensively evaluated the performance of SV calling with state-of-the-art long-read SV detection methods. The benchmark results demonstrate that almost all SV callers perform better when the long-read data reach 20× coverage, 20 kbp average read length, and approximately 10–7.5% or below 1% error rates. Furthermore, high sequencing coverage is the most influential factor in promoting SV calling, while it also directly determines the expensive costs.ConclusionsBased on the comprehensive evaluation results, we provide important guidelines for selecting long-read sequencing settings for efficient SV calling. We believe these recommended settings of long-read sequencing will have extraordinary guiding significance in cutting-edge genomic studies and clinical practices.

Highlights

  • With the rapid development of long-read sequencing technologies, it is possible to reveal the full spectrum of genetic structural variation (SV)

  • We provide recommendations regarding the long-read sequencing settings on the coverage, mean read length, and error rate that achieve better sequencing economy and effectiveness of SV detection, and this will play an important role in future research work for SV detection based on long-read sequencing and will have extraordinary guiding significance

  • Which is the most influential sequencing setting in SV calling? To further determine which sequencing attributes are the most influential factors that determine the performance regarding SV calling, we drew a heatmap of F1 scores under various sequencing settings and tools under all kinds of SV detection targets in this study

Read more

Summary

Introduction

With the rapid development of long-read sequencing technologies, it is possible to reveal the full spectrum of genetic structural variation (SV). Due to the increase in mappability based on excellent long-range spanning information, it is possible to collect variant evidence across tens to thousands of kilobases [14] and discover large and complex SVs, in repetitive genomic regions [15]. With these advancements, long-read sequencing technologies have become the most effective tool for revealing the full spectrum of genetic variation, improving the understanding of mutation and evolutionary processes, resolving some of the missing heritability, and helping to discover more novel biological insights [16]

Methods
Results
Discussion
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call