Abstract

High-coverage sequencing allows the study of variants occurring at low frequencies within samples, but is susceptible to false-positives caused by sequencing error. Ion Torrent has a very low single nucleotide variant (SNV) error rate and has been employed for the majority of human papillomavirus (HPV) whole genome sequences. However, benchmarking of intrahost SNVs (iSNVs) has been challenging, partly due to limitations imposed by the HPV life cycle. We address this problem by deep sequencing three replicates for each of 31 samples of HPV type 18 (HPV18). Errors, defined as iSNVs observed in only one of three replicates, are dominated by C→T (G→A) changes, independently of trinucleotide context. True iSNVs, defined as those observed in all three replicates, instead show a more diverse SNV type distribution, with particularly elevated C→T rates in CCG context (CCG→CTG; CGG→CAG) and C→A rates in ACG context (ACG→AAG; CGT→CTT). Characterization of true iSNVs allowed us to develop two methods for detecting true variants: (1) VCFgenie, a dynamic binomial filtering tool which uses each variant's allele count and coverage instead of fixed frequency cut-offs; and (2) a machine learning binary classifier which trains eXtreme Gradient Boosting models on variant features such as quality and trinucleotide context. Each approach outperforms fixed-cut-off filtering of iSNVs, and performance is enhanced when both are used together. Our results provide improved methods for identifying true iSNVs in within-host applications across sequencing platforms, specifically using HPV18 as a case study.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.