Abstract
BackgroundShort and long range correlations in biological sequences are central in genomic studies of covariation. These correlations can be studied using mutual information because it measures the amount of information one random variable contains about the other. Here we present MIA (Mutual Information Analyzer) a user friendly graphic interface pipeline that calculates spectra of vertical entropy (VH), vertical mutual information (VMI) and horizontal mutual information (HMI), since currently there is no user friendly integrated platform that in a single package perform all these calculations. MIA also calculates Jensen-Shannon Divergence (JSD) between pair of different species spectra, herein called informational distances. Thus, the resulting distance matrices can be presented by distance histograms and informational dendrograms, giving support to discrimination of closely related species.ResultsIn order to test MIA we analyzed sequences from Drosophila Adh locus, because the taxonomy and evolutionary patterns of different Drosophila species are well established and the gene Adh is extensively studied. The search retrieved 959 sequences of 291 species. From the total, 450 sequences of 17 species were selected. With this dataset MIA performed all tasks in less than three hours: gathering, storing and aligning fasta files; calculating VH, VMI and HMI spectra; and calculating JSD between pair of different species spectra. For each task MIA saved tables and graphics in the local disk, easily accessible for future analysis.ConclusionsOur tests revealed that the “informational model free” spectra may represent species signatures. Since JSD applied to Horizontal Mutual Information spectra resulted in statistically significant distances between species, we could calculate respective hierarchical clusters, herein called Informational Dendrograms (ID). When compared to phylogenetic trees all Informational Dendrograms presented similar taxonomy and species clusterization.Electronic supplementary materialThe online version of this article (doi:10.1186/s12859-015-0837-0) contains supplementary material, which is available to authorized users.
Highlights
Short and long range correlations in biological sequences are central in genomic studies of covariation
Thereafter, via Jensen-Shannon Divergence, we calculated distances between these spectra resulting in distance matrices capable of inferring the possibility of species discrimination
We tested our algorithms searching in NCBI, at nucleotide database, for Organism = “Drosophila” and Gene = “Adh” (Alcohol Dehydrogenase) resulting in 959 sequences of 291 species
Summary
Short and long range correlations in biological sequences are central in genomic studies of covariation. Most of the genes are highly conserved, especially in regions like promoters, TATA box, exons, splice junctions etc. Even these conserved regions, and others such as introns, can present many polymorphic single regions as well as two separated regions that present orchestrated mutations in a way as to try to conserve or improve determined phenotype [1,2,3,4]. We searched for polymorphic regions seeking out covariation signals by using DNA sequences from Drosophila Adh locus With these sequences we calculated entropy and mutual informational spectra from different closely related species sequences. Thereafter, via Jensen-Shannon Divergence, we calculated distances between these spectra resulting in distance matrices capable of inferring the possibility of species discrimination
Published Version (Free)
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.