Abstract

We present a method for clustering genomic sequences based on variations in local entropy. We have analyzed the distributions of the block entropies of viruses and plant genomes. A distinct pattern for viruses and plant genomes is observed. These distributions, which describe the local entropic variability of the genomes, are used for clustering the genomes based on the Jensen-Shannon (JS) distances. The analysis of the JS distances between all genomes that infect the chlorella algae shows the host specificity of the viruses. We illustrate the efficacy of this entropy-based clustering technique by the segregation of plant and virus genomes into separate bins.

Highlights

  • The organization of genomes has evolved dynamically by a stochastic process comprised of mutation and selection

  • We show that the local variations in entropy are very useful for clustering viruses and plant genomes, but may suggest host specificity of viruses

  • We have proposed a novel method for clustering genomic sequences based on variations in local entropy

Read more

Summary

Introduction

The organization of genomes has evolved dynamically by a stochastic process comprised of mutation and selection. When applied at the nucleotide level to genomic sequences, the sequence entropy for the full genome can be reduced to the question of CG-content [1,2]. This is a global property and related to the mutation rate, while the selective advantage reveals itself locally in, e.g., gene products such as regions coding for proteins.

Superinformation Revisited
Clustering of Genomic Sequences
Findings
Conclusions
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call