Abstract

Complementary to reference-based variant detection, recent studies revealed that many novel variants could be detected with de novo assembled genomes. To evaluate the effect of reads coverage and the accuracy of assembly-based variant calling, we simulated short reads containing more than 3 million of single nucleotide variants (SNVs) from the whole human genome and compared the efficiency of SNV calling between the assembly-based and alignment-based calling approaches. We assessed the quality of the assembled contig and found that a minimum of 30X coverage of short reads was needed to ensure reliable SNV calling and to generate assembled contigs with a good coverage of genome and genes. In addition, we observed that the assembly-based approach had a much lower recall rate and precision comparing to the alignment-based approach that would recover 99% of imputed SNVs. We observed similar results with experimental reads for NA24385, an individual whose germline variants were well characterized. Although there are additional values for SNVs detection, the assembly-based approach would have great risk of false discovery of novel SNVs. Further improvement of de novo assembly algorithms are needed in order to warrant a good completeness of genome with haplotype resolved and high fidelity of assembled sequences.

Highlights

  • Detection of genetic variants such as single nucleotide variants (SNVs), insertions and deletions (INDELs), and structural variants (SVs) is one of the major objectives for the usage of generation sequencing (NGS) in human genome research

  • By comparing SNVs called from alignment of assembled contigs and from alignment of reads to the “ground truth” (SNVs introduced into the template reference for simulation), we directly evaluated the performance of the two variant calling approaches

  • We investigated the coverage of genome, genes and exons by the assembled contigs against the coverage of reads used in the de novo assembly by aligning contigs to the reference genome

Read more

Summary

Introduction

Detection of genetic variants such as SNVs, insertions and deletions (INDELs), and structural variants (SVs) is one of the major objectives for the usage of generation sequencing (NGS) in human genome research. Genetic variant calling is based on alignment of raw sequence reads against a reference genome. We simulated short reads from the whole human genome for comparison between the assembly-based and alignment-based calling approaches. By comparing SNVs called from alignment of assembled contigs and from alignment of reads to the “ground truth” (SNVs introduced into the template reference for simulation), we directly evaluated the performance of the two variant calling approaches. We repeated this analysis process with reads sets from whole genome sequencing (WGS) of NA24385, an individual whose genome was fully sequenced and analyzed by the Genome In A Bottle (GIAB) consortium. We concluded that an assembly-based approach (with SOAPdenovo[2] as the assembly tool) might serve as a complimentary method for SNVs discovery, there were many false SNVs and missed calls due to sequence difference of two alleles in a diploid genome, such as the human genome

Methods
Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call