Abstract
BackgroundProducing cost-effective haplotype-resolved personal genomes remains challenging. 10x Linked-Read sequencing, with its high base quality and long-range information, has been demonstrated to facilitate de novo assembly of human genomes and variant detection. In this study, we investigate in depth how the parameter space of 10x library preparation and sequencing affects assembly quality, on the basis of both simulated and real libraries.ResultsWe prepared and sequenced eight 10x libraries with a diverse set of parameters from standard cell lines NA12878 and NA24385 and performed whole-genome assembly on the data. We also developed the simulator LRTK-SIM to follow the workflow of 10x data generation and produce realistic simulated Linked-Read data sets. We found that assembly quality could be improved by increasing the total sequencing coverage (C) and keeping physical coverage of DNA fragments (CF) or read coverage per fragment (CR) within broad ranges. The optimal physical coverage was between 332× and 823× and assembly quality worsened if it increased to >1,000× for a given C. Long DNA fragments could significantly extend phase blocks but decreased contig contiguity. The optimal length-weighted fragment length (Wn}{}{mu _{FL}}) was ∼50–150 kb. When broadly optimal parameters were used for library preparation and sequencing, ∼80% of the genome was assembled in a diploid state.ConclusionsThe Linked-Read libraries we generated and the parameter space we identified provide theoretical considerations and practical guidelines for personal genome assemblies based on 10x Linked-Read sequencing.
Highlights
The human genome holds the key for understanding the genetic basis of human evolution, hereditary illnesses and many phenotypes
The Linked-Read libraries we generated and the parameter space we identified provide theoretical considerations and practical guidelines for personal genome assemblies based on 10x Linked-Read sequencing
For scaffolds in the real data sets, when C increased from 48X (ܴଷ ) to 67X (ܴସ ), both scaffold N50 and NA50 were significantly improved (N50: 13.4Mb to 30.6Mb; NA50: 6.3Mb to 12.0Mb), but the accuracy dropped slightly from 46.6% to 39.1%, which indicated that scaffold accuracy may be refractory to extremely high C (Figure 2F). These results indicated that assembly length and accuracy were comparable over a broad range of coverage of DNA fragments (CF) and coverage per fragment (CR) at constant C, which implied that assembly quality was mainly determined by C
Summary
The Linked-Read libraries we generated and the parameter space we identified provide theoretical considerations and practical guidelines for personal genome assemblies based on 10x Linked-Read sequencing.
Published Version (Free)
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have