Genomic Taxonomy Boost by Lexical Clustering

Kosi Gramatikoff

doi:10.15406/jig.2014.01.00004

Abstract

In the post-genomic era, drawing inferences from multiple massive data sets is a ubiquitous challenge in the computational life sciences. Multiple sequence alignment has played a key role in genomics (and other “omics”) as a means of summarizing and representing relationships between sequences. However, two problems with alignment-based strategies are apparent: the computational expense of constructing alignments and the sensitivity of subsequent analyses to alignment uncertainties. Here we present a novel alignment-free alternative. We use frequency profiles (or n-gram vectors) for sequence comparison, a method inspired by lexical statistics. Such profiles can be used to infer relationships between texts or between biological sequences, and we demonstrate that two statistical techniques – hierarchical clustering (HC) and non-negative matrix factorization (NMF) – provide invaluable insights in both contexts. We present four case studies. First, we show that bigram frequency profiles can be used to reconstruct the ontology of 102,402 PubMed titles selected for their relevance to nine drugs and nine therapeutic proteins. Second, we apply the same methodology to classify 63 protein kinase coding DNA sequences into functional categories, based on trigram frequency profiles. The two major classes (Tyr vs Ser/Thr) are correctly identified. Third, and similarly, we show that Alu subfamilies can be identified in 58,122 Alu sequences, in perfect agreement with the accepted topology of the Alu phylogeny, again based only on trigram frequency profiles. Fourth, we clustered 8,885 human promoters using trigram frequency profiles for ab initio discovery of co-expression networks associated with disease. We demonstrate that “lexical” statistics offers a viable alignment-free approach to identifying and representing structural, functional and evolutionary relationships. We envision that our approach will be applicable to rapid and revealing comparison of whole individual genomes, and will be an important tool for analysis and correlation of “omics” data.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: Journal of Investigative Genomics	Publication Date: May 21, 2014
Citations: 2	License type: cc-by-nc

R Discovery Prime

R Discovery Prime

Genomic Taxonomy Boost by Lexical Clustering

Abstract

Talk to us

Similar Papers

More From: Journal of Investigative Genomics

Lead the way for us

Similar Papers

A Parallel Algorithm for Multiple Biological Sequence Alignment
Irma R Andalon-Garcia ... M E Meda-Campaña
-
Irma R Andalon-Garcia, et. al.Irma R Andalon-Garcia ... M E Meda-Campaña
01 Jan 2012
01 Jan 2012

Sequence Analysis and Characterization of Active Human Alu Subfamilies Based on the 1000 Genomes Pilot Project.
Miriam K Konkel ... Jerilyn A Walker
Genome Biology and Evolution | VOL. 7
Miriam K Konkel, et. al.Miriam K Konkel ... Jerilyn A Walker
29 Aug 2015
Sequence Analysis and Characterization of Active Human Alu Subfamilies Based on the 1000 Genomes Pilot Project.
Miriam K Konkel ... Jerilyn A Walker

Alu elements of the primate major histocompatibility complex.
M Mňuková-Fajdelová ... Y Satta
Mammalian genome : official journal of the International Mammalian Genome Society | VOL. 5
M Mňuková-Fajdelová, et. al.M Mňuková-Fajdelová ... Y Satta
01 Jul 1994
Mammalian genome : official journal of the International Mammalian Genome Society | VOL. 5

The domain structure and distribution of Alu elements in long noncoding RNAs and mRNAs.
Eugene Z Kim ... Daniel R Caffrey
RNA | VOL. 22
Eugene Z Kim, et. al.Eugene Z Kim ... Daniel R Caffrey
10 Dec 2015
RNA | VOL. 22

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Genomic Taxonomy Boost by Lexical Clustering

Abstract

Talk to us

Similar Papers

More From: Journal of Investigative Genomics