Structural Domain Based Multiple Instance Learning for Predicting Gram-Positive Bacterial Protein Subcellular Localization

S.Y Mei,Wang Fei

doi:10.1109/ijcbs.2009.14

Abstract

Until recently, far few researches have been reported on Gram-positive protein subcelluar location prediction. Novel computational method is highly needed to help biologist design experiment. In this paper, we are motivated to propose a novel machine learning model for predicting Gram-positive protein subcelluar localization, as an alternative to the existing models Gpos-PLoc when the required GO annotation information is unavailable. The model uses protein structural domain as indicator of protein subcelluar location. To capture protein sequence local information and structural domain boundary partition information, a novel method called Multiple Instance Multiclass Learning (MIMC) is proposed for predicting protein subcelluar location, where domain is taken as an instance of protein and protein as a bag of domains. Because some proteins may have multiple subcelluar locations, we introduce another related model called Multiple Instance Multiple Label Learning (MIML) to predict potential minor subcelluar locations. Protein sequence and domain are encoded using simple 20-D Amino Acid Composition (AA), so that feature dimensionality is greatly reduced and the Instance representation can capture domain boundary partition information as compared to flat domain vector representation. Experiments show that simple AA representation outperforms order-based Pseudo Amino Acid (PseAA) representation, and MIMC model performs comparably to Chou’s OET-NN ensemble (Gpos-PLoc),the only machine learning model for Gram-positive protein subcelluar location prediction thus far, to the best of our knowledge.

Full Text