Contamination in Reference Sequence Databases: Time for Divide-and-Rule Tactics.

Valérian Lupo,Hervé Vanderschuren,Frédéric Kerff,Denis Baurain,Mick Van Vlierberghe,Luc Cornet

doi:10.3389/fmicb.2021.755101

Valérian Lupo, Hervé Vanderschuren + Show 4 more

Open Access

https://doi.org/10.3389/fmicb.2021.755101

Copy DOI

Abstract

Contaminating sequences in public genome databases is a pervasive issue with potentially far-reaching consequences. This problem has attracted much attention in the recent literature and many different tools are now available to detect contaminants. Although these methods are based on diverse algorithms that can sometimes produce widely different estimates of the contamination level, the majority of genomic studies rely on a single method of detection, which represents a risk of systematic error. In this work, we used two orthogonal methods to assess the level of contamination among National Center for Biotechnological Information Reference Sequence Database (RefSeq) bacterial genomes. First, we applied the most popular solution, CheckM, which is based on gene markers. We then complemented this approach by a genome-wide method, termed Physeter, which now implements a k-folds algorithm to avoid inaccurate detection due to potential contamination of the reference database. We demonstrate that CheckM cannot currently be applied to all available genomes and bacterial groups. While it performed well on the majority of RefSeq genomes, it produced dubious results for 12,326 organisms. Among those, Physeter identified 239 contaminated genomes that had been missed by CheckM. In conclusion, we emphasize the importance of using multiple methods of detection while providing an upgrade of our own detection tool, Physeter, which minimizes incorrect contamination estimates in the context of unavoidably contaminated reference databases.

Highlights

Genome contamination, defined here as the accidental inclusion of sequences from other organisms or the misclassification of sequences in public repositories, is a problem having attracted much attention in the recent literature
It is notoriously known that contamination of genome-scale datasets can lead to false conclusions, and such cases have been reported in Abbreviations: RefSeq, Reference Sequence Database; LCA, Last Common Ancestor; Integrated Microbial Genomes (IMG), Integrated Microbial Genome; NCBI, National Center for Biotechnological Information; GTDB, Genome Taxonomy Database
By studying the phenomenon in Cyanobacteria, we have shown that different methods sometimes yield widely different estimates of the contamination level (Cornet et al, 2018)

Summary

INTRODUCTION

Genome contamination, defined here as the accidental inclusion of sequences from other organisms or the misclassification of sequences in public repositories, is a problem having attracted much attention in the recent literature (see for instance, Kahlke and Ralph, 2018; Lu and Salzberg, 2018; Breitwieser et al, 2019; Low et al, 2019). Some genomes belong to a taxon that is so rare in genome databases that they only match themselves, which is not allowed by the Physeter algorithm and leads to low levels of the expected organism (e.g., GCF_000226295.1), including 45 genomes tagged as “unclassified Bacteria” by the NCBI. “root”) with a low level of contamination (median 1.1%), whereas Physeter found high contamination levels (median 14.6%) for these 65 cases To deal with those 107 problematic genomes, we re-ran Physeter using the GTDB taxonomy (Parks et al, 2018) as an alternative and let the tool determine the main organism itself, just like CheckM usually does (see Supplementary Table 1). Biological traits like sheath thickness or the abundance of coliving organisms can explain the nature of the contaminants and the fact that some taxa have a higher propensity for contamination, the latter being affected by uneven sampling of lifestyles in RefSeq (e.g., lots of clinical samples)

DISCUSSION

METHODS

Findings

DATA AVAILABILITY STATEMENT

Full Text

Paper version not known

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: Frontiers in Microbiology	Publication Date: Oct 22, 2021
Citations: 26	License type: CC BY 4.0

R Discovery Prime

R Discovery Prime

Contamination in Reference Sequence Databases: Time for Divide-and-Rule Tactics.

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Frontiers in Microbiology

Lead the way for us

Similar Papers

Crabs-A software program to generate curated reference databases for metabarcoding sequencing data.
Gert‐Jan Jeunen ... Jonika Edgecombe
Molecular ecology resources | VOL. 23
Gert‐Jan Jeunen, et. al.Gert‐Jan Jeunen ... Jonika Edgecombe
11 Dec 2022
Molecular ecology resources | VOL. 23

EukRef: Phylogenetic curation of ribosomal RNA to enhance understanding of eukaryotic diversity and distribution.
Javier Del Campo ... Luciana F Santoferrara
PLoS Biology | VOL. 16
Javier Del Campo, et. al.Javier Del Campo ... Luciana F Santoferrara
17 Sep 2018
PLoS Biology | VOL. 16

Species-level classification of the vaginal microbiome
Jennifer M Fettweis ... Kimberly K Jefferson
BMC Genomics | VOL. 13
Jennifer M Fettweis, et. al.Jennifer M Fettweis ... Kimberly K Jefferson
01 Dec 2012
BMC Genomics | VOL. 13

Metabarcoding of zooplankton diversity within the Chukchi Borderland, Arctic Ocean: improved resolution from multi-gene markers and region-specific DNA databases
Jennifer M Questel ... Ksenia N Kosobokova
Senckenbergiana maritima | VOL. 51
Jennifer M Questel, et. al.Jennifer M Questel ... Ksenia N Kosobokova
09 Jan 2021
Senckenbergiana maritima | VOL. 51

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Contamination in Reference Sequence Databases: Time for Divide-and-Rule Tactics.

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Frontiers in Microbiology