HMMPred: Accurate Prediction of DNA-Binding Proteins Based on HMM Profiles and XGBoost Feature Selection.

Wanyue Xiao,Taigang Liu,Yang Yang,Xiuzhi Sang,Huiwen Zheng

doi:10.1155/2020/1384749

Abstract

Prediction of DNA-binding proteins (DBPs) has become a popular research topic in protein science due to its crucial role in all aspects of biological activities. Even though considerable efforts have been devoted to developing powerful computational methods to solve this problem, it is still a challenging task in the field of bioinformatics. A hidden Markov model (HMM) profile has been proved to provide important clues for improving the prediction performance of DBPs. In this paper, we propose a method, called HMMPred, which extracts the features of amino acid composition and auto- and cross-covariance transformation from the HMM profiles, to help train a machine learning model for identification of DBPs. Then, a feature selection technique is performed based on the extreme gradient boosting (XGBoost) algorithm. Finally, the selected optimal features are fed into a support vector machine (SVM) classifier to predict DBPs. The experimental results tested on two benchmark datasets show that the proposed method is superior to most of the existing methods and could serve as an alternative tool to identify DBPs.

Highlights

DNA-binding proteins (DBPs), which can bind to and interact with DNA, play prominent roles in the structural composition of DNA and the regulation of genes
We propose a novel method, called HMMPred, which utilizes features extracted solely from the hidden Markov model (HMM) profile to further improve the prediction accuracy of DBPs
Features are extracted from the HMM profiles by fusing three techniques, i.e., amino acid composition (AAC), UniProt database

Summary

Introduction

DNA-binding proteins (DBPs), which can bind to and interact with DNA, play prominent roles in the structural composition of DNA and the regulation of genes. These proteins have a variety of biochemical functions in the cell and molecular biology, including the participation and regulation of various cellular processes, such as transcription, DNA replication, recombination, modification, and repair [1, 2]. DBPs were normally identified by experimental techniques, such as filter binding assays, genetic analysis, Xray crystallography, ChIP-chip analysis, and nuclear magnetic resonance (NMR) [4]. With the rapid increase of protein sequence data, there is a great need to develop efficient computational methods to identify DBPs solely based on their primary sequences

Methods

Results

Conclusion