Abstract

Recent years have witnessed much progress in computational modelling for protein subcellular localization. However, the existing sequence-based predictive models demonstrate moderate or unsatisfactory performance, and the gene ontology (GO) based models may take the risk of performance overestimation for novel proteins. Furthermore, many human proteins have multiple subcellular locations, which renders the computational modelling more complicated. Up to the present, there are far few researches specialized for predicting the subcellular localization of human proteins that may reside in multiple cellular compartments. In this paper, we propose a multi-label multi-kernel transfer learning model for human protein subcellular localization (MLMK-TLM). MLMK-TLM proposes a multi-label confusion matrix, formally formulates three multi-labelling performance measures and adapts one-against-all multi-class probabilistic outputs to multi-label learning scenario, based on which to further extends our published work GO-TLM (gene ontology based transfer learning model for protein subcellular localization) and MK-TLM (multi-kernel transfer learning based on Chou's PseAAC formulation for protein submitochondria localization) for multiplex human protein subcellular localization. With the advantages of proper homolog knowledge transfer, comprehensive survey of model performance for novel protein and multi-labelling capability, MLMK-TLM will gain more practical applicability. The experiments on human protein benchmark dataset show that MLMK-TLM significantly outperforms the baseline model and demonstrates good multi-labelling ability for novel human proteins. Some findings (predictions) are validated by the latest Swiss-Prot database. The software can be freely downloaded at http://soft.synu.edu.cn/upload/msy.rar.

Highlights

  • Recent years have witnessed much progress in computational modelling for protein subcellular localization [1]

  • High threshold of sequence identity, e.g. 60%, can not guarantee that no noise would be introduced to the target protein; (2) remote homolog may be convergent to the target protein in terms of protein subcellular localization, for instance, the target protein P21291 resides in subcellular locations: Endoplasmic reticulum, Membrane and Microsome, while its first 7 significant remote homologs queried against SwissProt 57.3 database [13] with default Blast option: O75881(26.82%,4e-041),O02766(25.05%,4e-028), Q63688 (25.66%,2e-027),P22680(23.68%,4e-026),Q16850(23.92%, 4e-025),O88962 (25.05%, 4e-025), Q64505 (23.13%, 1e-024), reside in the subcellular locations: Endoplasmic reticulum, Membrane and Microsome

  • The protein with multiple subcellular locations should be treated as one training example of each subcellular location it belongs to, the same protein should be viewed as different protein within different subcellular location, referred to as locative protein in the literatures [4,5,15,19,27,28,29,30]

Read more

Summary

Introduction

Recent years have witnessed much progress in computational modelling for protein subcellular localization [1]. Based on HummPLoc, Shen HB et al [5] further proposed Hum-mPLoc2.0 for multiplex and novel human protein subcellular localization, where a more stringent human dataset with 25% sequence similarity threshold is constructed to train a kNN ensemble classifier. Mei S [26] further proposed an improved transfer learning model (MK-TLM), which conducted improvements on GO-TLM from the two major concerns: (1) more rational noise control over divergent homolog knowledge transfer; (2) comprehensive survey of model performance, especially for novel protein prediction. Neither GO-TLM nor MK-TLM is applicable to multiple protein subcellular localization prediction

Methods
Results
Conclusion
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.