Abstract

Background: High-throughput whole genome sequencing facilitates investigation of minority virus sub-populations from virus positive samples. Minority variants are useful in understanding within and between host diversity, population dynamics and can potentially assist in elucidating person-person transmission pathways. Several minority variant callers have been developed to describe low frequency sub-populations from whole genome sequence data. These callers differ based on bioinformatics and statistical methods used to discriminate sequencing errors from low-frequency variants. Methods: We evaluated the diagnostic performance and concordance between published minority variant callers used in identifying minority variants from whole-genome sequence data from virus samples. We used the ART-Illumina read simulation tool to generate three artificial short-read datasets of varying coverage and error profiles from an RSV reference genome. The datasets were spiked with nucleotide variants at predetermined positions and frequencies. Variants were called using FreeBayes, LoFreq, Vardict, and VarScan2. The variant callers' agreement in identifying known variants was quantified using two measures; concordance accuracy and the inter-caller concordance. Results: The variant callers reported differences in identifying minority variants from the datasets. Concordance accuracy and inter-caller concordance were positively correlated with sample coverage. FreeBayes identified the majority of variants although it was characterised by variable sensitivity and precision in addition to a high false positive rate relative to the other minority variant callers and which varied with sample coverage. LoFreq was the most conservative caller. Conclusions: We conducted a performance and concordance evaluation of four minority variant calling tools used to identify and quantify low frequency variants. Inconsistency in the quality of sequenced samples impacts on sensitivity and accuracy of minority variant callers. Our study suggests that combining at least three tools when identifying minority variants is useful infiltering errors when calling low frequency variants.

Highlights

  • RNA viruses have been described as a population of closely related sequences that arise from rapid genomic evolution coupled with a high replication and mutation rates (Domingo et al, 2012; Eigen et al, 1988; Holland et al, 1992)

  • We explored the tools’ ability to detect and quantify minority variants and assessed their overall agreement which we defined using two metrics, concordance accuracy, which measures the combined accuracy of the variant callers, and inter-caller concordance, which is the size of the largest set of variant callers that agree at each position

  • Concordance accuracy improved with increase in sample coverage (Figure 2), and the proportion of positions that could not be identified by any variant caller decreased with increase in coverage (Supplementary Figure 1)

Read more

Summary

Introduction

RNA viruses have been described as a population of closely related sequences that arise from rapid genomic evolution coupled with a high replication and mutation rates (Domingo et al, 2012; Eigen et al, 1988; Holland et al, 1992). A number of methods that incorporate both genomic and epidemiologic data to infer pathogen transmission have recently been developed (Worby et al, 2017) These approaches rely partly on the accurate detection and quantification of minority variant populations from genomic samples. Several minority variant callers have been developed to describe low frequency sub-populations from whole genome sequence data. Methods: We evaluated the diagnostic performance and concordance between published minority variant callers used in identifying minority variants from whole-genome sequence data from virus samples. FreeBayes identified the majority of variants it was characterised by variable sensitivity and precision in addition to a high false positive rate relative to the other minority variant callers and which varied with sample coverage.

Objectives
Methods
Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call