Abstract

This paper describes a new machine learning approach for prediction of DNA-binding residues from protein sequence data. Several biologically relevant features, including biochemical properties of amino acid residues and evolutionary information of protein sequences, were selected for input encoding. The evolutionary information was represented as position-specific scoring matrices (PSSMs) and several new descriptors developed in this study. The sequence-derived features were then used to train random forests (RFs), which could handle a large number of input variables and avoid model overfitting. The use of evolutionary information together with biochemical features was found to significantly improve classifier performance. The RF classifier was further evaluated using a separate test dataset. The results suggest that the RF-based approach gives rise to more accurate prediction of DNA-binding residues than previous studies.KeywordsDNA-binding site predictionfeature extractionevolutionary informationrandom forestsmachine learning

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.