Abstract

Type IV secreted effectors (T4SEs) can be translocated into the cytosol of host cells via type IV secretion system (T4SS) and cause diseases. However, experimental approaches to identify T4SEs are time- and resource-consuming, and the existing computational tools based on machine learning techniques have some obvious limitations such as the lack of interpretability in the prediction models. In this study, we proposed a new model, T4SE-XGB, which uses the eXtreme gradient boosting (XGBoost) algorithm for accurate identification of type IV effectors based on optimal features based on protein sequences. After trying 20 different types of features, the best performance was achieved when all features were fed into XGBoost by the 5-fold cross validation in comparison with other machine learning methods. Then, the ReliefF algorithm was adopted to get the optimal feature set on our dataset, which further improved the model performance. T4SE-XGB exhibited highest predictive performance on the independent test set and outperformed other published prediction tools. Furthermore, the SHAP method was used to interpret the contribution of features to model predictions. The identification of key features can contribute to improved understanding of multifactorial contributors to host-pathogen interactions and bacterial pathogenesis. In addition to type IV effector prediction, we believe that the proposed framework can provide instructive guidance for similar studies to construct prediction methods on related biological problems. The data and source code of this study can be freely accessed at https://github.com/CT001002/T4SE-XGB.

Highlights

  • Different secretion systems have been found in bacteria that secret proteins into the extracellular environment

  • Several features such as genomic organization, evolutionary based attributes, regulatory network attributes, and attributes specific to the L. pneumophila pathogenesis system were applied as input of the different machine learning algorithms: naïve Bayes, Bayesian networks, support vector machine (SVM), neural network and a voting classifier based on these four algorithms

  • This observation indicates that the features based on position specific scoring matrix composition (PSSM) have better performance in the prediction of T4SE when compared with other types of features

Read more

Summary

Introduction

Different secretion systems have been found in bacteria that secret proteins into the extracellular environment. A large number of computational methods have been developed for prediction of T4SEs in the last decade, which successfully speed up the process in terms of time and efficiency These computational approaches can be categorized into two main groups: the first group of approaches infer new effectors based on sequence similarity with currently known effectors (Chen et al, 2010; Lockwood et al, 2011; Marchesini et al, 2011; Meyer et al, 2013; Sankarasubramanian et al, 2016; Noroy et al, 2019) or phylogenetic profiling analysis (Zalguizuri et al, 2019), and the second group of approaches involve learning the patterns of known secreted effectors that distinguish them from nonsecreted proteins based on machine learning and deep learning techniques (Burstein et al, 2009; Lifshitz et al, 2013; Zou et al, 2013; Wang et al, 2014; Ashari et al, 2017; Wang Y. et al, 2017; Esna Ashari et al, 2018, 2019a,b; Guo et al, 2018; Xiong et al, 2018; Xue et al, 2018; Acici et al, 2019; Chao et al, 2019; Hong et al, 2019; Wang J. et al, 2019; Li J. et al, 2020; Yan et al, 2020). This tool yielded reasonable candidate effector predictions for most T4SS bacteria from the Alphaproteobacteria and Gammaproteobacteria classes

Methods
Results
Conclusion
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.