Ensemble of Multiple Classifiers for Multilabel Classification of Plant Protein Subcellular Localization.

Warin Wattanapornprom,Supatcha Lertampaiporn,Apiradee Hongsthong,Chinae Thammarongtham

doi:10.3390/life11040293

Abstract

The accurate prediction of protein localization is a critical step in any functional genome annotation process. This paper proposes an improved strategy for protein subcellular localization prediction in plants based on multiple classifiers, to improve prediction results in terms of both accuracy and reliability. The prediction of plant protein subcellular localization is challenging because the underlying problem is not only a multiclass, but also a multilabel problem. Generally, plant proteins can be found in 10–14 locations/compartments. The number of proteins in some compartments (nucleus, cytoplasm, and mitochondria) is generally much greater than that in other compartments (vacuole, peroxisome, Golgi, and cell wall). Therefore, the problem of imbalanced data usually arises. Therefore, we propose an ensemble machine learning method based on average voting among heterogeneous classifiers. We first extracted various types of features suitable for each type of protein localization to form a total of 479 feature spaces. Then, feature selection methods were used to reduce the dimensions of the features into smaller informative feature subsets. This reduced feature subset was then used to train/build three different individual models. In the process of combining the three distinct classifier models, we used an average voting approach to combine the results of these three different classifiers that we constructed to return the final probability prediction. The method could predict subcellular localizations in both single- and multilabel locations, based on the voting probability. Experimental results indicated that the proposed ensemble method could achieve correct classification with an overall accuracy of 84.58% for 11 compartments, on the basis of the testing dataset.

Highlights

Subcellular localization is one of the key properties considered in the functional annotation of proteins [1,2,3]
DAatatrsaetining and testing dataset obtained from Plant-mSubP [11] was used to train and AevtarlauiantiengthaenpdetrefsotrinmgadnacteaoseftthobetpairnoegdrafmromforPl1a1npt-rmotSeuinbPlo[c1a1t]iownass. uTsheedsetodtaratainwaenrde eavlraelaudatyefithlteepreedrfoacrmcoarndcinegofttohtehperocrgirtaermiofnoro1f1
Several features were included in this approach to represent proteins, such as the amino acid composition, pseudo amino acid composition, annotation-based methods (GO-based features), and sorting signals

Summary

Introduction

Subcellular localization is one of the key properties considered in the functional annotation of proteins [1,2,3]. Identifying the subcellular locations of proteins is immensely helpful for understanding their function and designing or identifying drug targets. Knowledge of protein localization might provide valuable information for target identification for drug discovery [4,5]. According to the statistical release for 2020 (Release: 2020_06 of 2 December 2020), UniProtKB contains 59,932,518 sequence entries, but only 350,510 of the proteins have a reviewed subcellular localization status (manually annotated) [6]. There is a need for an accurate alternative computational method that utilizes the capabilities of artificial intelligence and machine learning, to provide fast and accurate results for identifying new proteins [9]

Methods

Results

Conclusion