Abstract
In the post-genomic era, drawing inferences from multiple massive data sets is a ubiquitous challenge in the computational life sciences. Multiple sequence alignment has played a key role in genomics (and other “omics”) as a means of summarizing and representing relationships between sequences. However, two problems with alignment-based strategies are apparent: the computational expense of constructing alignments and the sensitivity of subsequent analyses to alignment uncertainties. Here we present a novel alignment-free alternative. We use frequency profiles (or n-gram vectors) for sequence comparison, a method inspired by lexical statistics. Such profiles can be used to infer relationships between texts or between biological sequences, and we demonstrate that two statistical techniques – hierarchical clustering (HC) and non-negative matrix factorization (NMF) – provide invaluable insights in both contexts. We present four case studies. First, we show that bigram frequency profiles can be used to reconstruct the ontology of 102,402 PubMed titles selected for their relevance to nine drugs and nine therapeutic proteins. Second, we apply the same methodology to classify 63 protein kinase coding DNA sequences into functional categories, based on trigram frequency profiles. The two major classes (Tyr vs Ser/Thr) are correctly identified. Third, and similarly, we show that Alu subfamilies can be identified in 58,122 Alu sequences, in perfect agreement with the accepted topology of the Alu phylogeny, again based only on trigram frequency profiles. Fourth, we clustered 8,885 human promoters using trigram frequency profiles for ab initio discovery of co-expression networks associated with disease. We demonstrate that “lexical” statistics offers a viable alignment-free approach to identifying and representing structural, functional and evolutionary relationships. We envision that our approach will be applicable to rapid and revealing comparison of whole individual genomes, and will be an important tool for analysis and correlation of “omics” data.
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.