N-gram analysis of 970 microbial organisms reveals presence of biological language models

Hatice Ulku Osmanbeyoglu,Madhavi K Ganapathiraju

doi:10.1186/1471-2105-12-12

Abstract

BackgroundIt has been suggested previously that genome and proteome sequences show characteristics typical of natural-language texts such as "signature-style" word usage indicative of authors or topics, and that the algorithms originally developed for natural language processing may therefore be applied to genome sequences to draw biologically relevant conclusions. Following this approach of 'biological language modeling', statistical n-gram analysis has been applied for comparative analysis of whole proteome sequences of 44 organisms. It has been shown that a few particular amino acid n-grams are found in abundance in one organism but occurring very rarely in other organisms, thereby serving as genome signatures. At that time proteomes of only 44 organisms were available, thereby limiting the generalization of this hypothesis. Today nearly 1,000 genome sequences and corresponding translated sequences are available, making it feasible to test the existence of biological language models over the evolutionary tree.ResultsWe studied whole proteome sequences of 970 microbial organisms using n-gram frequencies and cross-perplexity employing the Biological Language Modeling Toolkit and Patternix Revelio toolkit. Genus-specific signatures were observed even in a simple unigram distribution. By taking statistical n-gram model of one organism as reference and computing cross-perplexity of all other microbial proteomes with it, cross-perplexity was found to be predictive of branch distance of the phylogenetic tree. For example, a 4-gram model from proteome of Shigellae flexneri 2a, which belongs to the Gammaproteobacteria class showed a self-perplexity of 15.34 while the cross-perplexity of other organisms was in the range of 15.59 to 29.5 and was proportional to their branching distance in the evolutionary tree from S. flexneri. The organisms of this genus, which happen to be pathotypes of E.coli, also have the closest perplexity values with E. coli.ConclusionWhole proteome sequences of microbial organisms have been shown to contain particular n-gram sequences in abundance in one organism but occurring very rarely in other organisms, thereby serving as proteome signatures. Further it has also been shown that perplexity, a statistical measure of similarity of n-gram composition, can be used to predict evolutionary distance within a genus in the phylogenetic tree.

Highlights

It has been suggested previously that genome and proteome sequences show characteristics typical of natural-language texts such as “signature-style” word usage indicative of authors or topics, and that the algorithms originally developed for natural language processing may be applied to genome sequences to draw biologically relevant conclusions
Comparison of whole genomes/proteomes may not be feasible for large sets of organisms using multiple sequence alignment (MSA) based methods as only a small portion of genes is shared across all the organisms that are being compared
Frequencies of corresponding unigrams in other plant pathogens are shown in thin blue lines and those in animal pathogens are shown in thin red lines

Summary

Introduction

It has been suggested previously that genome and proteome sequences show characteristics typical of natural-language texts such as “signature-style” word usage indicative of authors or topics, and that the algorithms originally developed for natural language processing may be applied to genome sequences to draw biologically relevant conclusions. With the rapidly increasing availability of whole genome and proteome sequences of microbes, large scale computational recognition and comparison of patterns in biological sequences could be a first step towards discovering and understanding the biology of microbes and their diversity. Understanding their diversity is important to make progress in the field of medicine, public health and agriculture [2], and possibly in exploring alternate energy sources [3]. Comparison of whole genomes/proteomes may not be feasible for large sets of organisms using multiple sequence alignment (MSA) based methods as only a small portion of genes is shared across all the organisms that are being compared. Orthologous genes comparison (eg. as shown in [7]) which requires correct selection of orthologous genes, protein sequence/structure domains comparison (eg. as shown in [8,9]) which requires the assignment of protein domains at the sequence/structure level, and whole genome/proteome sequences (the pair-wise alignment eg. as shown in [10] or the alignment free eg. as shown in [11]) are the main approaches for inferring whole-genome-based phylogeny of microbial organisms

Methods

Results

Conclusion

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: BMC Bioinformatics	Publication Date: Jan 10, 2011
Citations: 84	License type: CC BY 2.0

R Discovery Prime

R Discovery Prime

N-gram analysis of 970 microbial organisms reveals presence of biological language models

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: BMC Bioinformatics

Lead the way for us

Similar Papers

KmerDB: A database encompassing the set of genomic and proteomic sequence information for each species
Ioannis Mouratidis ... Dionysios Chartoumpekis
Computational and Structural Biotechnology Journal | VOL. 23
Ioannis Mouratidis, et. al.Ioannis Mouratidis ... Dionysios Chartoumpekis
21 Apr 2024
Computational and Structural Biotechnology Journal | VOL. 23

Human and mouse alpha-synuclein genes: comparative genomic sequence analysis and identification of a novel gene regulatory element.
Jeffrey W Touchman ... Ornit Chiba-Falek
Genome Research | VOL. 11
Jeffrey W Touchman, et. al.Jeffrey W Touchman ... Ornit Chiba-Falek
01 Jan 2001
Genome Research | VOL. 11

Comparative n-gram analysis of whole-genome protein sequences
M Ganapathiraju ... J Klein-Seetharaman
-
M Ganapathiraju, et. al.M Ganapathiraju ... J Klein-Seetharaman
01 Jan 2002
01 Jan 2002

Comparative ngram analysis of whole-genome sequences
...
-
, et. al. ...
30 Jun 2018
30 Jun 2018

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

N-gram analysis of 970 microbial organisms reveals presence of biological language models

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: BMC Bioinformatics