Abstract
This paper presents a novel method to predict the functions of amino acid sequences, based on statistical machine translation programs. To build the translation model, we use the “parallel corpus” concept. For instance, an English sentence “I love apples” and its corresponding French sentence “j’adore les pommes” are examples of a parallel corpus. Here we regard an amino acid sequence like “MTMDKSELVQKA” as one language, and treat its functional description as “0005737 0006605 0019904 (Gene Ontology terms)” as a sentence of another language. We select amino acid sequences and their corresponding functional descriptions in Gene Ontology terms to build the parallel corpus. Then we use a phrase-based translation model to build the “amino acid sequence” to “protein function” translation model. The Bilingual Evaluation Understudy (BLEU) score, an algorithm for measuring the quality of machine-translated text, of the proposed method reaches about 0.6 when neglecting the order of Gene Ontology words. Although its functional prediction performance is still not as accurate as search-based methods, it was able to give the function of amino acid sequences directly and was more efficient.
Highlights
Determining the functions of proteins is a central problem in biology
All of the corpora can be downloaded from www.geneontology.org (Gene Ontology data) and www.uniprot.org. We provide these Gene ontology and amino acid sequence data in ‘data’ directory of supplementary material with this paper
Summary Based on statistical machine translation technology, we present a novel method to predict the function of amino acid sequences
Summary
Determining the functions of proteins is a central problem in biology. Nih.gov/refseq/) that store amino acid sequences and their corresponding functions. Almost all protein functional prediction methods rely on the identification, characterization, or quantification of sequence similarities between the available proteins of interest[1]. Even sequences that are similar do not necessarily have identical function. A sequence may be similar to many other sequences, so it can be difficult to choose the most appropriate one. There is no way to deduce function if there are no similar sequences in any available database
Published Version (Free)
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have