Gene ontology based transfer learning for protein subcellular localization

Suyu Mei,Shuigeng Zhou,Wang Fei

doi:10.1186/1471-2105-12-44

Suyu Mei, Shuigeng Zhou + Show 1 more

Open Access

https://doi.org/10.1186/1471-2105-12-44

Copy DOI

Journal: BMC Bioinformatics	Publication Date: Feb 2, 2011
Citations: 128	License type: CC BY 2.0

Affiliation: Shenyang Normal University, Fudan University

Abstract

BackgroundPrediction of protein subcellular localization generally involves many complex factors, and using only one or two aspects of data information may not tell the true story. For this reason, some recent predictive models are deliberately designed to integrate multiple heterogeneous data sources for exploiting multi-aspect protein feature information. Gene ontology, hereinafter referred to as GO, uses a controlled vocabulary to depict biological molecules or gene products in terms of biological process, molecular function and cellular component. With the rapid expansion of annotated protein sequences, gene ontology has become a general protein feature that can be used to construct predictive models in computational biology. Existing models generally either concatenated the GO terms into a flat binary vector or applied majority-vote based ensemble learning for protein subcellular localization, both of which can not estimate the individual discriminative abilities of the three aspects of gene ontology.ResultsIn this paper, we propose a Gene Ontology Based Transfer Learning Model (GO-TLM) for large-scale protein subcellular localization. The model transfers the signature-based homologous GO terms to the target proteins, and further constructs a reliable learning system to reduce the adverse affect of the potential false GO terms that are resulted from evolutionary divergence. We derive three GO kernels from the three aspects of gene ontology to measure the GO similarity of two proteins, and derive two other spectrum kernels to measure the similarity of two protein sequences. We use simple non-parametric cross validation to explicitly weigh the discriminative abilities of the five kernels, such that the time & space computational complexities are greatly reduced when compared to the complicated semi-definite programming and semi-indefinite linear programming. The five kernels are then linearly merged into one single kernel for protein subcellular localization. We evaluate GO-TLM performance against three baseline models: MultiLoc, MultiLoc-GO and Euk-mPLoc on the benchmark datasets the baseline models adopted. 5-fold cross validation experiments show that GO-TLM achieves substantial accuracy improvement against the baseline models: 80.38% against model Euk-mPLoc 67.40% with 12.98% substantial increase; 96.65% and 96.27% against model MultiLoc-GO 89.60% and 89.60%, with 7.05% and 6.67% accuracy increase on dataset MultiLoc plant and dataset MultiLoc animal, respectively; 97.14%, 95.90% and 96.85% against model MultiLoc-GO 83.70%, 90.10% and 85.70%, with accuracy increase 13.44%, 5.8% and 11.15% on dataset BaCelLoc plant, dataset BaCelLoc fungi and dataset BaCelLoc animal respectively. For BaCelLoc independent sets, GO-TLM achieves 81.25%, 80.45% and 79.46% on dataset BaCelLoc plant holdout, dataset BaCelLoc plant holdout and dataset BaCelLoc animal holdout, respectively, as compared against baseline model MultiLoc-GO 76%, 60.00% and 73.00%, with accuracy increase 5.25%, 20.45% and 6.46%, respectively.ConclusionsSince direct homology-based GO term transfer may be prone to introducing noise and outliers to the target protein, we design an explicitly weighted kernel learning system (called Gene Ontology Based Transfer Learning Model, GO-TLM) to transfer to the target protein the known knowledge about related homologous proteins, which can reduce the risk of outliers and share knowledge between homologous proteins, and thus achieve better predictive performance for protein subcellular localization. Cross validation and independent test experimental results show that the homology-based GO term transfer and explicitly weighing the GO kernels substantially improve the prediction performance.

Highlights

Prediction of protein subcellular localization generally involves many complex factors, and using only one or two aspects of data information may not tell the true story
PROSITE uses regular expression to represent significant amino acid patterns or uses profile to detect structural and functional domains; PRINTS collects protein family fingerprints; PFAM is a database of protein domain families that contains curated multiple sequence alignments for each family and corresponding profile hidden Markov models (HMMs); ProDom provides automatic domain query that is based on recursive use of PSI-BLAST homology search; SMART collects domains that are extensively annotated with respect to phyletic distributions, functional class, tertiary structures and functionally important residues; TIGRFAMs are a collection of protein families that are characteristic of curated multiple sequence alignments, Hidden Markov Models (HMMs) and associated information supporting functional identification of proteins by sequence homology
The first dataset MultiLoc collects 5859 proteins that are unevenly distributed to 10 subcellular locations for Plant data and 9 subcellular locations for Fungi data and Animal data [58]; the second dataset BaCelLoc, originally from the work [76], collects 491 proteins for Plant, 1198 proteins for Fungi and 2597 proteins for Animal that are unevenly located in 5 subcellular locations for Plant and 4 subcellular location for Fungi and Animal [58,77]; the third dataset Euk-mPLoc collects 5618 proteins that are unevenly located in 22 subcellular locations, the largest dataset as far in terms of number of subcellular locations [50]

Summary

Introduction

Prediction of protein subcellular localization generally involves many complex factors, and using only one or two aspects of data information may not tell the true story. For this reason, some recent predictive models are deliberately designed to integrate multiple heterogeneous data sources for exploiting multi-aspect protein feature information. With the rapid expansion of annotated protein sequences, gene ontology has become a general protein feature that can be used to construct predictive models in computational biology. Existing models generally either concatenated the GO terms into a flat binary vector or applied majority-vote based ensemble learning for protein subcellular localization, both of which can not estimate the individual discriminative abilities of the three aspects of gene ontology. Data integration has become a popular method to integrate diverse biological data, including non-sequence information, such as GO annotation, protein-protein interaction network, protein structural information, cell image features etc

Methods

Results

Discussion

Conclusion

Full Text

Paper version not known

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Gene ontology based transfer learning for protein subcellular localization

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: BMC Bioinformatics

Lead the way for us

Similar Papers

ProLoc-GO: Utilizing informative Gene Ontology terms for sequence-based prediction of protein subcellular localization
Wen-Lin Huang ... Shih-Wen Ho
BMC Bioinformatics | VOL. 9
Wen-Lin Huang, et. al.Wen-Lin Huang ... Shih-Wen Ho
01 Feb 2008
BMC Bioinformatics | VOL. 9

Linking molecular function and biological process terms in the ontology for gene expression data analysis
M Dejongh ... P Van Dort
-
M Dejongh, et. al.M Dejongh ... P Van Dort
01 Jan 2004
01 Jan 2004

Analysis of expressed sequence tags from cDNA library of Fusarium culmorum infected barley (Hordeum vulgare L.) roots.
Feyza Tufan ... Filiz Gürel
Bioinformation | VOL. 11
Feyza Tufan, et. al.Feyza Tufan ... Filiz Gürel
30 Jan 2015
Bioinformation | VOL. 11

Defining functional distances over Gene Ontology
Angela Del Pozo ... Alfonso Valencia
BMC Bioinformatics | VOL. 9
Angela Del Pozo, et. al.Angela Del Pozo ... Alfonso Valencia
25 Jan 2008
BMC Bioinformatics | VOL. 9

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Gene ontology based transfer learning for protein subcellular localization

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: BMC Bioinformatics