Distinguishing low frequency mutations from RT-PCR and sequence errors in viral deep sequencing data.

Richard J Orton,Marco J Morelli,Caroline F Wright,Donald P King,Daniel T Haydon,David J Paton,David J King

doi:10.1186/s12864-015-1456-x

Abstract

BackgroundRNA viruses have high mutation rates and exist within their hosts as large, complex and heterogeneous populations, comprising a spectrum of related but non-identical genome sequences. Next generation sequencing is revolutionising the study of viral populations by enabling the ultra deep sequencing of their genomes, and the subsequent identification of the full spectrum of variants within the population. Identification of low frequency variants is important for our understanding of mutational dynamics, disease progression, immune pressure, and for the detection of drug resistant or pathogenic mutations. However, the current challenge is to accurately model the errors in the sequence data and distinguish real viral variants, particularly those that exist at low frequency, from errors introduced during sequencing and sample processing, which can both be substantial.ResultsWe have created a novel set of laboratory control samples that are derived from a plasmid containing a full-length viral genome with extremely limited diversity in the starting population. One sample was sequenced without PCR amplification whilst the other samples were subjected to increasing amounts of RT and PCR amplification prior to ultra-deep sequencing. This enabled the level of error introduced by the RT and PCR processes to be assessed and minimum frequency thresholds to be set for true viral variant identification. We developed a genome-scale computational model of the sample processing and NGS calling process to gain a detailed understanding of the errors at each step, which predicted that RT and PCR errors are more likely to occur at some genomic sites than others. The model can also be used to investigate whether the number of observed mutations at a given site of interest is greater than would be expected from processing errors alone in any NGS data set. After providing basic sample processing information and the site’s coverage and quality scores, the model utilises the fitted RT-PCR error distributions to simulate the number of mutations that would be observed from processing errors alone.ConclusionsThese data sets and models provide an effective means of separating true viral mutations from those erroneously introduced during sample processing and sequencing.Electronic supplementary materialThe online version of this article (doi:10.1186/s12864-015-1456-x) contains supplementary material, which is available to authorized users.

Highlights

RNA viruses have high mutation rates and exist within their hosts as large, complex and heterogeneous populations, comprising a spectrum of related but non-identical genome sequences
One sample was ultra-deep sequenced on the Illumina platform without any reverse transcribed (RT)-polymerase chain reaction (PCR) sample processing, whilst the other samples were either PCR or RTPCR amplified prior to sequencing. This enabled the level of error introduced by the RT and PCR processes to be individually assessed and enabled minimum frequency thresholds to be set for true viral variant identification
The mutation spectrum from a real foot-and-mouth disease virus (FMDV) sample obtained from a foot lesion of an infected cow is included in Figure 4, this sample underwent the same processing as the RT-PCR sample

Summary

Introduction

RNA viruses have high mutation rates and exist within their hosts as large, complex and heterogeneous populations, comprising a spectrum of related but non-identical genome sequences. Mutation rates of RNA viruses are cited to be between 10−3 and 10−6 mutations per nucleotide per transcription cycle [1,2,3,4], mutations can potentially be introduced every time a viral genome is replicated As a result, these viruses exist within their hosts as large, complex and heterogeneous populations, comprising a spectrum of related but non-identical genome sequences [5,6,7,8]. The massively parallel and high-throughput nature of NGS platforms, combined with the relatively short genome of an RNA virus, enables the analysis of viral samples (which can contain billions of virions) to a very high depth Such ultra-deep coverage of the genome enables the diversity of the whole viral population to be examined and subsequently compared between samples to investigate evolutionary events such as selection and bottlenecks. A current problem with the application of NGS platforms to viral population analysis is that true low frequency viral variants cannot be effectively distinguished from variants caused by errors during sample preparation or sequencing

Methods

Results

Discussion

Conclusion

Full Text

Paper version not known

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: BMC Genomics	Publication Date: Mar 24, 2015
Citations: 101	License type: CC BY 4.0

R Discovery Prime

R Discovery Prime

Distinguishing low frequency mutations from RT-PCR and sequence errors in viral deep sequencing data.

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: BMC Genomics

Lead the way for us

Similar Papers

SAMQA: error classification and validation of high-throughput sequenced read data
Thomas Robinson ... John Boyle
BMC Genomics | VOL. 12
Thomas Robinson, et. al.Thomas Robinson ... John Boyle
18 Aug 2011
BMC Genomics | VOL. 12

Estimating sequencing error rates using families
Kelley Paskov ... Dennis P Wall
BioData Mining | VOL. 14
Kelley Paskov, et. al.Kelley Paskov ... Dennis P Wall
23 Apr 2021
BioData Mining | VOL. 14

Abstract A57: Uncovering instrument errors in next-generation sequencing by CleanDeepSeq2
Eric Davis ... John Easton
Clinical Cancer Research | VOL. 26
Eric Davis, et. al.Eric Davis ... John Easton
01 Jun 2020
Clinical Cancer Research | VOL. 26

Evaluation of the impact of Illumina error correction tools on de novo genome assembly
Mahdi Heydari ... Giles Miclotte
BMC Bioinformatics | VOL. 18
Mahdi Heydari, et. al.Mahdi Heydari ... Giles Miclotte
18 Aug 2017
BMC Bioinformatics | VOL. 18

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Distinguishing low frequency mutations from RT-PCR and sequence errors in viral deep sequencing data.

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: BMC Genomics