Abstract
Metagenomics and single-cell genomics have revolutionized the study of microorganisms, increasing our knowledge of microbial genomic diversity by orders of magnitude. A major issue pertaining to metagenome-assembled genomes (MAGs) and single-cell amplified genomes (SAGs) is to estimate their completeness and redundancy. Most approaches rely on counting conserved gene markers. In miComplete, we introduce a weighting strategy, where we normalize the presence/absence of markers by their median distance to the next marker in a set of complete reference genomes. This approach alleviates biases introduced by the presence/absence of shorter DNA pieces containing many markers, e.g. ribosomal protein operons. miComplete is written in Python 3 and released under GPLv3. Source code and documentation are available at https://bitbucket.org/evolegiolab/micomplete. Supplementary data are available at Bioinformatics online.
Highlights
The developments of high-throughput sequencing have led to an ever-increasing affordability and availability of large-scale sequencing projects
Genomes from uncultured microorganisms may be obtained by sorting cells on a flow cytometer, amplifying and sequencing their DNA
Markers are not uniformly distributed around prokaryotic chromosomes, and a certain amount of linkage is conserved even across long evolutionary distances (e.g. Rogozin et al, 2002; Lathe et al, 2000). This is especially important since commonly used marker sets often include ribosomal protein genes, which are organized in conserved operons: the presence or absence of ribosomal protein genes should contribute to completeness and redundancy less than that of other non-clustered genes
Summary
The developments of high-throughput sequencing have led to an ever-increasing affordability and availability of large-scale sequencing projects. The fraction of identified markers corresponds to genome completeness, while additional copies represent either contamination or redundancy (Rinke et al, 2013). This approach is implemented e.g. in CheckM (Parks et al, 2015) and BUSCO (Sim~ao et al, 2015). Markers are not uniformly distributed around prokaryotic chromosomes, and a certain amount of linkage is conserved even across long evolutionary distances (e.g. Rogozin et al, 2002; Lathe et al, 2000) This is especially important since commonly used marker sets often include ribosomal protein genes, which are organized in conserved operons: the presence or absence of ribosomal protein genes (or of other markers generally close to others) should contribute to completeness and redundancy less than that of other non-clustered genes
Published Version (
Free)
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have