An investigation of causes of false positive single nucleotide polymorphisms using simulated reads from a small eukaryote genome.

Antonio Ribeiro,Andrew J Flavell,David Marshall,Agnieszka Golicz,Iain Milne,Christine Anne Hackett,Micha Bayer,Gordon Stephen

doi:10.1186/s12859-015-0801-z

Abstract

BackgroundSingle Nucleotide Polymorphisms (SNPs) are widely used molecular markers, and their use has increased massively since the inception of Next Generation Sequencing (NGS) technologies, which allow detection of large numbers of SNPs at low cost. However, both NGS data and their analysis are error-prone, which can lead to the generation of false positive (FP) SNPs. We explored the relationship between FP SNPs and seven factors involved in mapping-based variant calling — quality of the reference sequence, read length, choice of mapper and variant caller, mapping stringency and filtering of SNPs by read mapping quality and read depth. This resulted in 576 possible factor level combinations. We used error- and variant-free simulated reads to ensure that every SNP found was indeed a false positive.ResultsThe variation in the number of FP SNPs generated ranged from 0 to 36,621 for the 120 million base pairs (Mbp) genome. All of the experimental factors tested had statistically significant effects on the number of FP SNPs generated and there was a considerable amount of interaction between the different factors. Using a fragmented reference sequence led to a dramatic increase in the number of FP SNPs generated, as did relaxed read mapping and a lack of SNP filtering. The choice of reference assembler, mapper and variant caller also significantly affected the outcome. The effect of read length was more complex and suggests a possible interaction between mapping specificity and the potential for contributing more false positives as read length increases.ConclusionsThe choice of tools and parameters involved in variant calling can have a dramatic effect on the number of FP SNPs produced, with particularly poor combinations of software and/or parameter settings yielding tens of thousands in this experiment. Between-factor interactions make simple recommendations difficult for a SNP discovery pipeline but the quality of the reference sequence is clearly of paramount importance. Our findings are also a stark reminder that it can be unwise to use the relaxed mismatch settings provided as defaults by some read mappers when reads are being mapped to a relatively unfinished reference sequence from e.g. a non-model organism in its early stages of genomic exploration.Electronic supplementary materialThe online version of this article (doi:10.1186/s12859-015-0801-z) contains supplementary material, which is available to authorized users.

Highlights

Single Nucleotide Polymorphisms (SNPs) are widely used molecular markers, and their use has increased massively since the inception of Generation Sequencing (NGS) technologies, which allow detection of large numbers of Single-nucleotide polymorphism (SNP) at low cost
To simplify the design of the experiment, we used only the 150 bp read length dataset for assembly. Our choice of this read length was based on two considerations: a) a large number of ongoing sequencing projects use Illumina Hiseq reads as their primary source of sequence and the current maximum read length for this is 150 bp, and b) even projects involving the assembly of very large, complex genomes such as wheat [22] use reads as short as this or even shorter as their primary source of sequence
First and foremost, the quality of the reference sequence is of paramount importance

Summary

Introduction

Single Nucleotide Polymorphisms (SNPs) are widely used molecular markers, and their use has increased massively since the inception of Generation Sequencing (NGS) technologies, which allow detection of large numbers of SNPs at low cost. Both NGS data and their analysis are error-prone, which can lead to the generation of false positive (FP) SNPs. We explored the relationship between FP SNPs and seven factors involved in mapping-based variant calling — quality of the reference sequence, read length, choice of mapper and variant caller, mapping stringency and filtering of SNPs by read mapping quality and read depth. Their uptake appears to have been slow, and the majority of projects currently still employ a mapping-based approach for SNP discovery

Objectives

Methods

Results

Conclusion

Full Text

Published Version (Free)

View/Download pdf

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: BMC Bioinformatics	Publication Date: Nov 11, 2015
Citations: 57	License type: cc-by

R Discovery Prime

An investigation of causes of false positive single nucleotide polymorphisms using simulated reads from a small eukaryote genome.

Abstract

Highlights

Summary

Published Version (Free)

Talk to us

Similar Papers

More From: BMC Bioinformatics

Lead the way for us

Similar Papers

SNPs Selection using Gravitational Search Algorithm and Exhaustive Search for Association Mapping
W A Kusuma ... L S Hasibuan
IOP Conference Series: Earth and Environmental Science | VOL. 31
W A Kusuma, et. al.W A Kusuma ... L S Hasibuan
01 Jan 2015
IOP Conference Series: Earth and Environmental Science | VOL. 31

Using molecular markers in breeding: ornamentals catch up
M.J.M Smulders ... P.M Bourke
Acta Horticulturae | VOL. 1283
M.J.M Smulders, et. al.M.J.M Smulders ... P.M Bourke
01 Jun 2020
Acta Horticulturae | VOL. 1283

Choice of reference sequence and assembler for alignment of Listeria monocytogenes short-read sequence data greatly influences rates of error in SNP analyses.
Arthur W Pightling ... Andrew R Dalby
PLoS ONE | VOL. 9
Arthur W Pightling, et. al.Arthur W Pightling ... Andrew R Dalby
21 Aug 2014
PLoS ONE | VOL. 9

Short Read (Next-Generation) Sequencing
Jaya Punetha ... Eric P Hoffman
Circulation: Cardiovascular Genetics | VOL. 6
Jaya Punetha, et. al.Jaya Punetha ... Eric P Hoffman
14 Jul 2013
Circulation: Cardiovascular Genetics | VOL. 6

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

An investigation of causes of false positive single nucleotide polymorphisms using simulated reads from a small eukaryote genome.

Abstract

Highlights

Summary

Published Version (Free)

Talk to us

Similar Papers

More From: BMC Bioinformatics