Abstract

DNA-binding proteins play vital roles in cellular processes, such as DNA packaging, replication, transcription, regulation, and other DNA-associated activities. The current main prediction method is based on machine learning, and its accuracy mainly depends on the features extraction method. Therefore, using an efficient feature representation method is important to enhance the classification accuracy. However, existing feature representation methods cannot efficiently distinguish DNA-binding proteins from non-DNA-binding proteins. In this paper, a multi-feature representation method, which combines three feature representation methods, namely, K-Skip-N-Grams, Information theory, and Sequential and structural features (SSF), is used to represent the protein sequences and improve feature representation ability. In addition, the classifier is a support vector machine. The mixed-feature representation method is evaluated using 10-fold cross-validation and a test set. Feature vectors, which are obtained from a combination of three feature extractions, show the best performance in 10-fold cross-validation both under non-dimensional reduction and dimensional reduction by max-relevance-max-distance. Moreover, the reduced mixed feature method performs better than the non-reduced mixed feature technique. The feature vectors, which are a combination of SSF and K-Skip-N-Grams, show the best performance in the test set. Among these methods, mixed features exhibit superiority over the single features.

Highlights

  • DNA-binding proteins are a significant component of living organisms, including prokaryotes and eukaryotic proteomes, such as plant mitochondria [1], human bodies, etc

  • The training setset hashas protein sequences with equal numbers of DNA-binding and nontypes, whereas the test protein sequences with equal numbers of DNA-binding and DNA-binding types, whereas the test set has protein sequences with equal numbers of DNAnon-DNA-binding types

  • Given the limitations of the three feature representation methods, this paper considers the mixed feature representation methods to ensure that each new feature vector contains various features

Read more

Summary

Introduction

DNA-binding proteins are a significant component of living organisms, including prokaryotes and eukaryotic proteomes, such as plant mitochondria [1], human bodies, etc. Molecules 2017, 22, 1602 as filter-binding assays, genomic analysis, micro-matrix, and chromosomal immunoprecipitation reactions [8] These experimental methods can provide detailed information for DNA-binding proteins. In the last few decades, many machine-learning methods have been developed for predicting DNA-binding proteins. These methods are divided into two types, namely, sequence-based and structure-based methods. The features are extracted by using sequence information, such as amino acid composition and amino acid amount, without considering any structural information [10] These methods are highly efficient and useful in predicting large-scale protein sequence datasets [8]. Structure-based feature representation methods use structural and sequence information to identify proteins [10].

Overview
Dataset
Classifier
Single Feature Representation Methods
Mixed Feature Representation Methods and Feature Selection
Measurement
Method
Performance of the Mixed Features
Comparison with State-of-the-Art Methods
Comparison with Other Classifiers
Findings
Conclusions
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call