Abstract

Contaminating sequences in public genome databases is a pervasive issue with potentially far-reaching consequences. This problem has attracted much attention in the recent literature and many different tools are now available to detect contaminants. Although these methods are based on diverse algorithms that can sometimes produce widely different estimates of the contamination level, the majority of genomic studies rely on a single method of detection, which represents a risk of systematic error. In this work, we used two orthogonal methods to assess the level of contamination among National Center for Biotechnological Information Reference Sequence Database (RefSeq) bacterial genomes. First, we applied the most popular solution, CheckM, which is based on gene markers. We then complemented this approach by a genome-wide method, termed Physeter, which now implements a k-folds algorithm to avoid inaccurate detection due to potential contamination of the reference database. We demonstrate that CheckM cannot currently be applied to all available genomes and bacterial groups. While it performed well on the majority of RefSeq genomes, it produced dubious results for 12,326 organisms. Among those, Physeter identified 239 contaminated genomes that had been missed by CheckM. In conclusion, we emphasize the importance of using multiple methods of detection while providing an upgrade of our own detection tool, Physeter, which minimizes incorrect contamination estimates in the context of unavoidably contaminated reference databases.

Highlights

  • Genome contamination, defined here as the accidental inclusion of sequences from other organisms or the misclassification of sequences in public repositories, is a problem having attracted much attention in the recent literature

  • It is notoriously known that contamination of genome-scale datasets can lead to false conclusions, and such cases have been reported in Abbreviations: RefSeq, Reference Sequence Database; LCA, Last Common Ancestor; Integrated Microbial Genomes (IMG), Integrated Microbial Genome; NCBI, National Center for Biotechnological Information; GTDB, Genome Taxonomy Database

  • By studying the phenomenon in Cyanobacteria, we have shown that different methods sometimes yield widely different estimates of the contamination level (Cornet et al, 2018)

Read more

Summary

INTRODUCTION

Genome contamination, defined here as the accidental inclusion of sequences from other organisms or the misclassification of sequences in public repositories, is a problem having attracted much attention in the recent literature (see for instance, Kahlke and Ralph, 2018; Lu and Salzberg, 2018; Breitwieser et al, 2019; Low et al, 2019). Some genomes belong to a taxon that is so rare in genome databases that they only match themselves, which is not allowed by the Physeter algorithm and leads to low levels of the expected organism (e.g., GCF_000226295.1), including 45 genomes tagged as “unclassified Bacteria” by the NCBI. “root”) with a low level of contamination (median 1.1%), whereas Physeter found high contamination levels (median 14.6%) for these 65 cases To deal with those 107 problematic genomes, we re-ran Physeter using the GTDB taxonomy (Parks et al, 2018) as an alternative and let the tool determine the main organism itself, just like CheckM usually does (see Supplementary Table 1). Biological traits like sheath thickness or the abundance of coliving organisms can explain the nature of the contaminants and the fact that some taxa have a higher propensity for contamination, the latter being affected by uneven sampling of lifestyles in RefSeq (e.g., lots of clinical samples)

DISCUSSION
METHODS
Findings
DATA AVAILABILITY STATEMENT
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call