Abstract
The development and innovation of next generation sequencing (NGS) and the subsequent analysis tools have gain popularity in scientific researches and clinical diagnostic applications. Hence, a systematic comparison of the sequencing platforms and variant calling pipelines could provide significant guidance to NGS-based scientific and clinical genomics. In this study, we compared the performance, concordance and operating efficiency of 27 combinations of sequencing platforms and variant calling pipelines, testing three variant calling pipelines—Genome Analysis Tool Kit HaplotypeCaller, Strelka2 and Samtools-Varscan2 for nine data sets for the NA12878 genome sequenced by different platforms including BGISEQ500, MGISEQ2000, HiSeq4000, NovaSeq and HiSeq Xten. For the variants calling performance of 12 combinations in WES datasets, all combinations displayed good performance in calling SNPs, with their F-scores entirely higher than 0.96, and their performance in calling INDELs varies from 0.75 to 0.91. And all 15 combinations in WGS datasets also manifested good performance, with F-scores in calling SNPs were entirely higher than 0.975 and their performance in calling INDELs varies from 0.71 to 0.93. All of these combinations manifested high concordance in variant identification, while the divergence of variants identification in WGS datasets were larger than that in WES datasets. We also down-sampled the original WES and WGS datasets at a series of gradient coverage across multiple platforms, then the variants calling period consumed by the three pipelines at each coverage were counted, respectively. For the GIAB datasets on both BGI and Illumina platforms, Strelka2 manifested its ultra-performance in detecting accuracy and processing efficiency compared with other two pipelines on each sequencing platform, which was recommended in the further promotion and application of next generation sequencing technology. The results of our researches will provide useful and comprehensive guidelines for personal or organizational researchers in reliable and consistent variants identification.
Highlights
Revolutionary generation sequencing (NGS) technologies have remarkably decreased the cost of genome sequencing and enabled the promotion of technology, leading to the brilliant achievements in genome sequencing projects such as 1000 genome project[1] and HapMap project[2,3]
0.21%, 1.28%, 1.76%, 7.29% and 8.25% low-quality reads generated from MGISEQ2000, NovaSeq, BGISEQ500, HiSeq Xten and HiSeq4000 whole-genome sequencing (WGS) datasets were separately cleaned out after quality control
We compared the distributions of F-scores of each combination on single-nucleotide polymorphisms (SNPs) and insertions and deletions (INDELs), which were more informative than mere precision or sensitivity (Fig. 2B)
Summary
Revolutionary generation sequencing (NGS) technologies have remarkably decreased the cost of genome sequencing and enabled the promotion of technology, leading to the brilliant achievements in genome sequencing projects such as 1000 genome project[1] and HapMap project[2,3]. In order to draw a conclusion that can be generalized on the genomes from various personal samples, evaluation of variants calling pipelines based on multiple datasets generated from various sequencing platforms systems is necessary These studies adopted measurement of Precision and Recall, or True-Positive (TP) and False-Positive (FP) to benchmark performance of different pipelines, which were not adequate to reflect the intrinsic trade-off between Precision and Recall and the more informative measurements such as F-score or the area under a precision-recall curve (APR) were needed. We aimed to evaluate the accuracy and efficiency of the combinations of different sequencer and pipeline for small variants systematically, to identify the most precise and efficient combinations and define the optimal variants calling pipeline for each sequencing platform according to its performance, concordance, time-consuming, and to provide useful guidelines of reliable variants identification for personal or organizational researchers in genomes sequencing
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.