Abstract

DNA-binding plays a crucial role in different genomics processes including identification of specific nucleotides, regulation of transcription and regulation of gene expression. Various conventional methods have been used for identification of DNA-binding proteins. However, due to large explosion of protein sequences in databases, it is intricate or sometimes impossible to identify DNA-binding proteins. Therefore, it is intensively desired to establish an automated model for identification of DNA binding proteins. In this model, numerical attributes are extracted through Dipeptide composition, Split Amino Acid Composition, and position specific scoring matrix (PSSM). In order to overcome the issue of biasness and reduce true error, oversampling technique SMOTE was applied to balance the datasets. Several classification learners including K-nearest neighbor, Probability Neural Network, Support vector machine (SVM) and Random forest are utilized. Two benchmark datasets and jackknife test are applied to assess the performance of classification algorithms. Among various classification algorithms, SVM achieved the highest success rates in conjunction with PSSM feature space, which are 92.3% accuracy on dataset1 and 88.5% on dataset2. The empirical results revealed that our proposed model obtained the highest results so far in the literatures. It is anticipated that our proposed model might be useful and provides a substance for research and academia community.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call