Abstract
DNA-binding protein (DBP) is a protein with a special DNA binding domain that is associated with many important molecular biological mechanisms. Rapid development of computational methods has made it possible to predict DBP on a large scale; however, existing methods do not fully integrate DBP-related features, resulting in rough prediction results. In this article, we develop a DNA-binding protein identification method called KK-DBP. To improve prediction accuracy, we propose a feature extraction method that fuses multiple PSSM features. The experimental results show a prediction accuracy on the independent test dataset PDB186 of 81.22%, which is the highest of all existing methods.
Highlights
Proteins are spatially structured substances formed by the complex folding of amino acids into polypeptide chains through dehydration and condensation
We selected four different performance measures, accuracy (ACC), specificity (SP), sensitivity (SN) and Matthew’s correlation coefficient (MCC), to evaluate the methodology used by this study to demonstrate the predictive ability of the model used (Wei et al, 2014; Wei et al, 2017b; Manavalan et al, 2019a; Manavalan et al, 2019b; Jin et al, 2019; Su et al, 2019; Li et al, 2020a; Liu et al, 2020a; Ao et al, 2020; Li et al, 2020b; Zhang et al, 2020b; Yu et al, 2020; Zhao et al, 2020; Wang et al, 2021c; Zhu et al, 2021)
Performance of Different Features on Training Set PDB1075 A large amount of information on homologous proteins is contained in evolutionarily informative features based on the position specificity score matrix (PSSM) matrix
Summary
Proteins are spatially structured substances formed by the complex folding of amino acids into polypeptide chains through dehydration and condensation. IDNAPro-PseAAC (Liu et al, 2015), which uses a similar feature extraction method, adopts a prediction model based on a support vector machine to predict DBP. A number of DNA-binding protein prediction methods based on different strategies exist Most of these DBP prediction methods fail to extract features based on evolutionary information, so their robustness and prediction accuracy have much room for improvement. When given a protein sequence, BLAST can represent the evolutionary information of a protein by aligning it with data in a specific database and extracting a position specific score matrix (PSSM). Because each protein sequence in the dataset will consist of the pseudo composition of all of its dipeptides, we can generate a 110-dimensional vector feature of RPSSM, defined as follows:.
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.