Abstract

Biophysical and biochemical principles govern biological sequences (e.g., DNA, RNA, and protein sequences) similar to the way the grammar of a natural language determines the structure of clauses and sentences. This analogy motivates “life language processing,” that is, treating biological sequences as the output of a certain language and adopting/developing language processing methods to perform analyses and predictions in that language. In this chapter, we present two specific tasks related to life language processing: (1) Developing language-model based representations for biological sequences: the large gap between the number of known sequences (raw data) versus the number of known functions/structures associated with these sequences (metadata), encourages us to develop methods that can obtain prior knowledge from the existing sequences to be used in bioinformatics tasks (e.g., protein structure and function predictions). Continuous vector representations of words, known as word vectors, have recently become popular in natural language processing (NLP) as an efficient unsupervised approach to represent semantic/syntactic units of text, helping in the downstream NLP tasks (e.g., machine translation, part-of-speech tagging, information retrieval, etc.). In this work, we propose distributed vector representations of biological sequence segments (n-grams or k-mers in the bioinformatics literature) called biovectors, using a skip-gram neural network. We propose an intrinsic evaluation of biovectors by measuring the continuity of the underlying biophysical and biochemical properties (e.g., average mass, hydrophobicity, charge, etc.). In addition to intrinsic evaluations, for the purposes of extrinsic evaluations, we have employed this representation in the classification of 324,018 protein sequences belonging to 7027 protein families, where an average family classification accuracy of 93%±0.06% was obtained. In addition, incorporation of biovector representation versus one-hot vector features in a Max-margin Markov Network (M3Net) for intron-exon prediction and domain identification tasks could improve the sequence labeling accuracy from 73.84% to 74.99% and from 82.4% to 89.8%, respectively. (2) Performing a language model-based comparison of genomic language variations using biovectors: the purpose is to quantify the distances between syntactic and semantic features of two genomic language variations, with various applications in comparative genomics. The training model of biovectors is analogous to neural probabilistic language modeling. This makes the network of n-grams in this space an indirect representation of the underlying language model. Considering this fact, we propose a new quantitative measure of distance between genomic language variations based on the divergence between networks of n-grams in different genetic variations, called word embedding language divergence. We perform language comparison for the coding regions in the genomes of 12 different organisms (4 plants, 6 animals, and 2 human subjects). Our results confirm a significant high-level difference in the genetic language model of humans/animals versus plants. The proposed method is a step toward defining a new quantitative measure of similarity between genomic languages, with applications in characterization/classification of sequences of interest.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.