Large-scale k-mer-based analysis of the informational properties of genomes, comparative genomics and taxonomy

Yuval Bussi,Ziv Reich,Ruti Kapon

doi:10.1371/journal.pone.0258693

Abstract

Information theoretic approaches are ubiquitous and effective in a wide variety of bioinformatics applications. In comparative genomics, alignment-free methods, based on short DNA words, or k-mers, are particularly powerful. We evaluated the utility of varying k-mer lengths for genome comparisons by analyzing their sequence space coverage of 5805 genomes in the KEGG GENOME database. In subsequent analyses on four k-mer lengths spanning the relevant range (11, 21, 31, 41), hierarchical clustering of 1634 genus-level representative genomes using pairwise 21- and 31-mer Jaccard similarities best recapitulated a phylogenetic/taxonomic tree of life with clear boundaries for superkingdom domains and high subtree similarity for named taxons at lower levels (family through phylum). By analyzing ~14.2M prokaryotic genome comparisons by their lowest-common-ancestor taxon levels, we detected many potential misclassification errors in a curated database, further demonstrating the need for wide-scale adoption of quantitative taxonomic classifications based on whole-genome similarity.

Highlights

Information theory, initially developed for the mathematical analysis of communication systems by Shannon [1], has been applied to molecular biology for decades
We evaluated the utility of varying k-mer lengths for genome comparisons by analyzing their sequence space coverage of 5805 genomes in the KEGG GENOME database
Taxonomic hierarchy information was retrieved from the National Center for Biotechnology Information (NCBI) taxonomy database [55] via the myTAI R package [56]; for each organism we retrieved all available labels for its taxonomic levels

Summary

Introduction

Information theory, initially developed for the mathematical analysis of communication systems by Shannon [1], has been applied to molecular biology for decades. Gatlin’s pioneering works in the late 1960s were the first to define life as an information processing system [2, 3]. The application of information theory to biological sequences, concomitant with developments in sequencing technology and computational processing, has been foundational to the burgeoning field of bioinformatics. Within this field, a significant area of investigation is naturally devoted to the genome, wherein all of the hereditary information necessary to build and maintain an organism is stored.

Objectives

Methods

Results

Conclusion

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: PLOS ONE	Publication Date: Oct 14, 2021
Citations: 26	License type: CC BY 4.0

R Discovery Prime

R Discovery Prime

Large-scale k-mer-based analysis of the informational properties of genomes, comparative genomics and taxonomy

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: PLOS ONE

Lead the way for us

Similar Papers

Search for a 'Tree of Life' in the thicket of the phylogenetic forest
Pere Puigbò ... Eugene V Koonin
Journal of Biology | VOL. 8
Pere Puigbò, et. al.Pere Puigbò ... Eugene V Koonin
01 Jan 2009
Journal of Biology | VOL. 8

The network of life: genome beginnings and evolution
Mark A. Ragan ... James O. McInerney
Philosophical Transactions of the Royal Society B: Biological Sciences | VOL. 364
Mark A. Ragan, et. al.Mark A. Ragan ... James O. McInerney
12 Aug 2009
Philosophical Transactions of the Royal Society B: Biological Sciences | VOL. 364

Alignment-free Whole Genome Comparison Using k-mer Forests
G Gamage ... V Mallawaarachchi
-
G Gamage, et. al.G Gamage ... V Mallawaarachchi
01 Sep 2019
01 Sep 2019

Genomics of bacteria and archaea: the emerging dynamic view of the prokaryotic world
E V Koonin ... Y I Wolf
Nucleic Acids Research | VOL. 36
E V Koonin, et. al.E V Koonin ... Y I Wolf
23 Oct 2008
Nucleic Acids Research | VOL. 36

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Large-scale k-mer-based analysis of the informational properties of genomes, comparative genomics and taxonomy

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: PLOS ONE