Abstract

Multi-label proteins occur in two or more subcellular locations, which play a vital role in cell development and metabolism. Prediction and analysis of multi-label subcellular localization (SCL) can present new perspective with drug target identification and new drug design. However, the prediction of multi-label protein SCL using biological experiments is expensive and labor-intensive. Therefore, predicting large-scale SCL with machine learning methods has turned into a popular study topic in bioinformatics. In this study, a novel multi-label learning methods for protein SCL prediction, called DMLDA-LocLIFT, is proposed. Firstly, the dipeptide composition (DC), encoding based on grouped weight (EBGW), pseudo amino acid composition (PseAAC), gene ontology (GO) and pseudo-position specific scoring matrix (PsePSSM) are employed to encode subcellular protein sequences. Then, using direct multi-label linear discriminant analysis (DMLDA) to get rid of noise information of the fused feature vector. Lastly, the first-best feature vectors are input into the multi-label learning with Label-specIfic FeaTures (LIFT) classifier to predict. The leave-one-out cross validation (LOOCV) shows that the overall actual accuracy on Gram-negative bacteria, Gram-positive bacteria, plant datasets, virus dataset and human dataset are 98.6%, 99.6%, 97.9%, 94.7% and 96.1% respectively, which are obviously better than other state-of-the-art prediction methods. The proposed model can effectively predict SCL of multi-label proteins and provide references for experimental identification of SCL. The source codes and datasets are available at https://github.com/QUST-AIBBDRC/DMLDA-LocLIFT/.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call