Genomic diversity affects the accuracy of bacterial single-nucleotide polymorphism-calling pipelines.

Stephen J Bush,Derrick W Crook,Tim E A Peto,David W Eyre,Emily L Clark,Dona Foster,Liam P Shaw,Nicola De Maio,Nicole Stoesser,A Sarah Walker

doi:10.1093/gigascience/giaa007

Abstract

BackgroundAccurately identifying single-nucleotide polymorphisms (SNPs) from bacterial sequencing data is an essential requirement for using genomics to track transmission and predict important phenotypes such as antimicrobial resistance. However, most previous performance evaluations of SNP calling have been restricted to eukaryotic (human) data. Additionally, bacterial SNP calling requires choosing an appropriate reference genome to align reads to, which, together with the bioinformatic pipeline, affects the accuracy and completeness of a set of SNP calls obtained. This study evaluates the performance of 209 SNP-calling pipelines using a combination of simulated data from 254 strains of 10 clinically common bacteria and real data from environmentally sourced and genomically diverse isolates within the genera Citrobacter, Enterobacter, Escherichia, and Klebsiella.ResultsWe evaluated the performance of 209 SNP-calling pipelines, aligning reads to genomes of the same or a divergent strain. Irrespective of pipeline, a principal determinant of reliable SNP calling was reference genome selection. Across multiple taxa, there was a strong inverse relationship between pipeline sensitivity and precision, and the Mash distance (a proxy for average nucleotide divergence) between reads and reference genome. The effect was especially pronounced for diverse, recombinogenic bacteria such as Escherichia coli but less dominant for clonal species such as Mycobacterium tuberculosis.ConclusionsThe accuracy of SNP calling for a given species is compromised by increasing intra-species diversity. When reads were aligned to the same genome from which they were sequenced, among the highest-performing pipelines was Novoalign/GATK. By contrast, when reads were aligned to particularly divergent genomes, the highest-performing pipelines often used the aligners NextGenMap or SMALT, and/or the variant callers LoFreq, mpileup, or Strelka.

Highlights

Identifying single-nucleotide polymorphisms (SNPs) from bacterial DNA is essential for monitoring outbreaks and predicting phenotypes, such as antimicrobial resistance [3], the pipeline selected for this task strongly affects the outcome [4]
We have performed a comparison of SNP-calling pipelines across both simulated and real data in multiple bacterial species, allowing us to benchmark their performance for this specific use
We find that all pipelines show extensive species-specific variation in performance, which has not been apparent from the majority of existing, human-centred, benchmarking studies

Summary

Introduction

Identifying single-nucleotide polymorphisms (SNPs) from bacterial DNA is essential for monitoring outbreaks (as in [1, 2]) and predicting phenotypes, such as antimicrobial resistance [3], the pipeline selected for this task strongly affects the outcome [4]. Referencebased mapping approaches use a known reference genome to guide this process, using a combination of an aligner, which identifies the location in the genome from which each read is likely to have arisen, and a variant caller, which summarizes the available information at each site to identify variants including SNPs and indels (see reviews for an overview of alignment [5, 6] and SNP calling [7] algorithms) This evaluation focuses only on SNP calling; we did not evaluate indel calling because this can require different algorithms (see review [8]). By contrast, when reads were aligned to divergent genomes, the highest-performing pipelines often used the aligners NextGenMap or SMALT, and/or the variant callers LoFreq, mpileup, or Strelka

Methods

Results

Discussion

Conclusion