Translate gene sequence into gene ontology terms based on statistical machine translation

Wang Liang,Zhao Kai Yong

doi:10.12688/f1000research.2-231.v1

Wang Liang, Zhao Kai Yong

Open Access

PDF Available

https://doi.org/10.12688/f1000research.2-231.v1

Copy DOI

Export

Save

Cite

Abstract
Highlights/Summary
Full-Text PDF
Similar Papers

Abstract

Listen

This paper presents a novel method to predict the functions of amino acid sequences, based on statistical machine translation programs. To build the translation model, we use the “parallel corpus” concept. For instance, an English sentence “I love apples” and its corresponding French sentence “j’adore les pommes” are examples of a parallel corpus. Here we regard an amino acid sequence like “MTMDKSELVQKA” as one language, and treat its functional description as “0005737 0006605 0019904 (Gene Ontology terms)” as a sentence of another language. We select amino acid sequences and their corresponding functional descriptions in Gene Ontology terms to build the parallel corpus. Then we use a phrase-based translation model to build the “amino acid sequence” to “protein function” translation model. The Bilingual Evaluation Understudy (BLEU) score, an algorithm for measuring the quality of machine-translated text, of the proposed method reaches about 0.6 when neglecting the order of Gene Ontology words. Although its functional prediction performance is still not as accurate as search-based methods, it was able to give the function of amino acid sequences directly and was more efficient.

Highlights

Determining the functions of proteins is a central problem in biology
All of the corpora can be downloaded from www.geneontology.org (Gene Ontology data) and www.uniprot.org. We provide these Gene ontology and amino acid sequence data in ‘data’ directory of supplementary material with this paper
Summary Based on statistical machine translation technology, we present a novel method to predict the function of amino acid sequences

Summary

Introduction

Determining the functions of proteins is a central problem in biology. Nih.gov/refseq/) that store amino acid sequences and their corresponding functions. Almost all protein functional prediction methods rely on the identification, characterization, or quantification of sequence similarities between the available proteins of interest[1]. Even sequences that are similar do not necessarily have identical function. A sequence may be similar to many other sequences, so it can be difficult to choose the most appropriate one. There is no way to deduce function if there are no similar sequences in any available database

Objectives

Methods

Findings

Conclusion

Full Text

Published Version (Free)

View/Download pdf

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

Translate gene sequence into gene ontology terms based on statistical machine translation

Abstract

Highlights

Summary

Published Version (Free)

Talk to us

Similar Papers

More From: F1000Research

Lead the way for us

Journal: F1000Research	Publication Date: Nov 1, 2013
License type: CC BY 3.0

Similar Papers

Indonesian to Bengkulu Malay Statistical Machine Translation System
Bella Okta Sari Miranda ... Herman Yuliansyah
International Journal of Advances in Data and Information Systems | VOL. 5
Bella Okta Sari Miranda, et. al.Bella Okta Sari Miranda ... Herman Yuliansyah
29 Sep 2024
International Journal of Advances in Data and Information Systems | VOL. 5

New avenues in protein function prediction
Iddo Friedberg ... Adam Godzik
Protein Science | VOL. 15
Iddo Friedberg, et. al.Iddo Friedberg ... Adam Godzik
01 Jun 2006
Protein Science | VOL. 15

Identification and Analysis of Single- and Multiple-Region Mitotic Protein Complexes by Grouping Gene Ontology Terms
Wen Lin Huang ... Chyn Liaw
Applied Mechanics and Materials | VOL. 421
Wen Lin Huang, et. al.Wen Lin Huang ... Chyn Liaw
11 Sep 2013
Applied Mechanics and Materials | VOL. 421

Answering Gene Ontology terms to proteomics questions by supervised macro reading in Medline
Julien Gobeill ... Patrick Ruch
EMBnet.journal | VOL. 18
Julien Gobeill, et. al.Julien Gobeill ... Patrick Ruch
09 Nov 2012
EMBnet.journal | VOL. 18

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

Translate gene sequence into gene ontology terms based on statistical machine translation

Abstract

Highlights

Summary

Published Version (Free)

Talk to us

Similar Papers

More From: F1000Research