Choice of reference sequence and assembler for alignment of Listeria monocytogenes short-read sequence data greatly influences rates of error in SNP analyses.

Arthur W Pightling,Nicholas Petronella,Franco Pagotto,Andrew R Dalby

doi:10.1371/journal.pone.0104579

Arthur W Pightling, Nicholas Petronella + Show 2 more

Open Access

https://doi.org/10.1371/journal.pone.0104579

Copy DOI

Journal: PloS one	Publication Date: Aug 21, 2014
Citations: 103	License type: CC BY 4.0

Affiliation: Health Canada

Abstract

The wide availability of whole-genome sequencing (WGS) and an abundance of open-source software have made detection of single-nucleotide polymorphisms (SNPs) in bacterial genomes an increasingly accessible and effective tool for comparative analyses. Thus, ensuring that real nucleotide differences between genomes (i.e., true SNPs) are detected at high rates and that the influences of errors (such as false positive SNPs, ambiguously called sites, and gaps) are mitigated is of utmost importance. The choices researchers make regarding the generation and analysis of WGS data can greatly influence the accuracy of short-read sequence alignments and, therefore, the efficacy of such experiments. We studied the effects of some of these choices, including: i) depth of sequencing coverage, ii) choice of reference-guided short-read sequence assembler, iii) choice of reference genome, and iv) whether to perform read-quality filtering and trimming, on our ability to detect true SNPs and on the frequencies of errors. We performed benchmarking experiments, during which we assembled simulated and real Listeria monocytogenes strain 08-5578 short-read sequence datasets of varying quality with four commonly used assemblers (BWA, MOSAIK, Novoalign, and SMALT), using reference genomes of varying genetic distances, and with or without read pre-processing (i.e., quality filtering and trimming). We found that assemblies of at least 50-fold coverage provided the most accurate results. In addition, MOSAIK yielded the fewest errors when reads were aligned to a nearly identical reference genome, while using SMALT to align reads against a reference sequence that is ∼0.82% distant from 08-5578 at the nucleotide level resulted in the detection of the greatest numbers of true SNPs and the fewest errors. Finally, we show that whether read pre-processing improves SNP detection depends upon the choice of reference sequence and assembler. In total, this study demonstrates that researchers should test a variety of conditions to achieve optimal results.

Highlights

Comprehensive sequencing and analysis of bacterial genomes are increasingly valuable tools in fields such as epidemiology [1,2,3], population genetics [4,5], and experimental evolution [6]
We assessed the efficacy of four commonly used referenceguided short-read sequence assemblers (BWA, MOSAIK, Novoalign, and SMALT) to generate alignments suitable for accurate detection of single-nucleotide polymorphisms (SNPs) using both simulated reads and actual reads obtained from sequencing runs of Listeria monocytogenes strain 08-5578 genomic DNA on an Illumina MiSeq benchtop machine
Increased accessibility of whole-genome sequence data, an abundance of open-source short-read sequence assembly software, and the proven utility of SNP detection in a number of fields requires that factors that can influence the quality of assemblies and, confidence in SNP calling be carefully considered

Summary

Introduction

Comprehensive sequencing and analysis of bacterial genomes are increasingly valuable tools in fields such as epidemiology [1,2,3], population genetics [4,5], and experimental evolution [6]. SNP analyses can be performed with de-novo assemblies. Assemblies performed against references often yield more data than de-novo assemblies, especially when sequence coverage is low [13]. Inaccuracies in reference-guided short-read sequence alignments may arise due to inherent errors associated with a given sequencing technology or the quality of DNA extractions and library preparations, such events are more likely to arise from misassembled reads [14], especially if appropriate pre- and postprocessing of reads have been performed such as read-quality trimming and filtering and local realignments around indels [11,15,16]. The genetic distances between reference and subject sequences are likely to effect SNP detection as more distant references may provide additional challenges for referenceguided assemblers [12]

Methods

Results

Conclusion