Abstract

Accurate prediction of essential proteins by using computational methods can effectively reduce the cost of wet-lab experiments. Existing computational methods usually rely on constructed protein-protein interaction (PPI) networks with different kinds of biological data. However, high-quality PPI networks and other biological data are not available for all proteins. Thus, it is very necessary and valuable to develop accurate methods for fast and effective prediction of essential proteins by using only protein sequences. We propose EPGBDT, a machine learning ensemble model, to improve the performance of essential protein prediction by using only protein sequences. EP-GBDT has an ensemble structure that combines multiple Gradient Boosting Decision Tree (GBDT) base classifiers. In addition, to reduce the effects of imbalanced dataset, EP-GBDT uses a sampling technique. The results show that EP-GBDT outperforms state-of-the-art sequence-based methods and network-based centrality measures. The source code and datasets can be downloaded from https://github.com/CSUBioGroup/EP-GBDT.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call