PeNGaRoo, a combined gradient boosting and ensemble learning framework for predicting non-classical secreted proteins.

Yanju Zhang,Tatsuya Akutsu,Jiahui Li,André Leier,A Ian Smith,Jiawei Wang,Tatiana T Marquez-Lago,Ruopeng Xie,Zongyuan Ge,Sha Yu,Jiangning Song,Trevor Lithgow

doi:10.1093/bioinformatics/btz629

Abstract

Gram-positive bacteria have developed secretion systems to transport proteins across their cell wall, a process that plays an important role during host infection. These secretion mechanisms have also been harnessed for therapeutic purposes in many biotechnology applications. Accordingly, the identification of features that select a protein for efficient secretion from these microorganisms has become an important task. Among all the secreted proteins, 'non-classical' secreted proteins are difficult to identify as they lack discernable signal peptide sequences and can make use of diverse secretion pathways. Currently, several computational methods have been developed to facilitate the discovery of such non-classical secreted proteins; however, the existing methods are based on either simulated or limited experimental datasets. In addition, they often employ basic features to train the models in a simple and coarse-grained manner. The availability of more experimentally validated datasets, advanced feature engineering techniques and novel machine learning approaches creates new opportunities for the development of improved predictors of 'non-classical' secreted proteins from sequence data. In this work, we first constructed a high-quality dataset of experimentally verified 'non-classical' secreted proteins, which we then used to create benchmark datasets. Using these benchmark datasets, we comprehensively analyzed a wide range of features and assessed their individual performance. Subsequently, we developed a two-layer Light Gradient Boosting Machine (LightGBM) ensemble model that integrates several single feature-based models into an overall prediction framework. At this stage, LightGBM, a gradient boosting machine, was used as a machine learning approach and the necessary parameter optimization was performed by a particle swarm optimization strategy. All single feature-based LightGBM models were then integrated into a unified ensemble model to further improve the predictive performance. Consequently, the final ensemble model achieved a superior performance with an accuracy of 0.900, an F-value of 0.903, Matthew's correlation coefficient of 0.803 and an area under the curve value of 0.963, and outperforming previous state-of-the-art predictors on the independent test. Based on our proposed optimal ensemble model, we further developed an accessible online predictor, PeNGaRoo, to serve users' demands. We believe this online web server, together with our proposed methodology, will expedite the discovery of non-classically secreted effector proteins in Gram-positive bacteria and further inspire the development of next-generation predictors. http://pengaroo.erc.monash.edu/. Supplementary data are available at Bioinformatics online.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

PeNGaRoo, a combined gradient boosting and ensemble learning framework for predicting non-classical secreted proteins.

Abstract

Talk to us

Similar Papers

More From: Bioinformatics

Lead the way for us

Journal: Bioinformatics	Publication Date: Aug 8, 2019
Citations: 39

Similar Papers

Racial Disparity in the Genomic Basis of Radiosensitivity – An Exploration of Whole-Transcriptome Sequencing Data via a Machine-Learning Approach
R Van Dams ... C Wang
International Journal of Radiation Oncology*Biology*Physics | VOL. 105
R Van Dams, et. al.R Van Dams ... C Wang
01 Sep 2019
International Journal of Radiation Oncology*Biology*Physics | VOL. 105

Machine learning-based ensemble prediction model for the gamma passing rate of VMAT-SBRT plan
Wenzhao Sun ... Yongbao Li
Physica Medica | VOL. 117
Wenzhao Sun, et. al.Wenzhao Sun ... Yongbao Li
27 Dec 2023
Physica Medica | VOL. 117

Sec-secretion and sortase-mediated anchoring of proteins in Gram-positive bacteria
Olaf Schneewind ... Dominique Missiakas
Biochimica et Biophysica Acta (BBA) - Molecular Cell Research | VOL. 1843
Olaf Schneewind, et. al.Olaf Schneewind ... Dominique Missiakas
22 Nov 2013
Biochimica et Biophysica Acta (BBA) - Molecular Cell Research | VOL. 1843

Abstract 4331: Machine learning-based classification of tissue origin of cancer using methylation profiles
Marco A De Velasco ... Yurie Kura
Cancer Research | VOL. 84
Marco A De Velasco, et. al.Marco A De Velasco ... Yurie Kura
22 Mar 2024
Abstract 4331: Machine learning-based classification of tissue origin of cancer using methylation profiles
Marco A De Velasco ... Yurie Kura

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

PeNGaRoo, a combined gradient boosting and ensemble learning framework for predicting non-classical secreted proteins.

Abstract

Talk to us

Similar Papers

More From: Bioinformatics