Abstract

Information theoretic approaches are ubiquitous and effective in a wide variety of bioinformatics applications. In comparative genomics, alignment-free methods, based on short DNA words, or k-mers, are particularly powerful. We evaluated the utility of varying k-mer lengths for genome comparisons by analyzing their sequence space coverage of 5805 genomes in the KEGG GENOME database. In subsequent analyses on four k-mer lengths spanning the relevant range (11, 21, 31, 41), hierarchical clustering of 1634 genus-level representative genomes using pairwise 21- and 31-mer Jaccard similarities best recapitulated a phylogenetic/taxonomic tree of life with clear boundaries for superkingdom domains and high subtree similarity for named taxons at lower levels (family through phylum). By analyzing ~14.2M prokaryotic genome comparisons by their lowest-common-ancestor taxon levels, we detected many potential misclassification errors in a curated database, further demonstrating the need for wide-scale adoption of quantitative taxonomic classifications based on whole-genome similarity.

Highlights

  • Information theory, initially developed for the mathematical analysis of communication systems by Shannon [1], has been applied to molecular biology for decades

  • We evaluated the utility of varying k-mer lengths for genome comparisons by analyzing their sequence space coverage of 5805 genomes in the KEGG GENOME database

  • Taxonomic hierarchy information was retrieved from the National Center for Biotechnology Information (NCBI) taxonomy database [55] via the myTAI R package [56]; for each organism we retrieved all available labels for its taxonomic levels

Read more

Summary

Introduction

Information theory, initially developed for the mathematical analysis of communication systems by Shannon [1], has been applied to molecular biology for decades. Gatlin’s pioneering works in the late 1960s were the first to define life as an information processing system [2, 3]. The application of information theory to biological sequences, concomitant with developments in sequencing technology and computational processing, has been foundational to the burgeoning field of bioinformatics. Within this field, a significant area of investigation is naturally devoted to the genome, wherein all of the hereditary information necessary to build and maintain an organism is stored.

Objectives
Methods
Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call