Abstract
Disordered regions, i.e., regions of proteins that do not adopt a stable three-dimensional structure, have been shown to play various and critical roles in many biological processes. Predicting and understanding their formation is therefore a key sub-problem of protein structure and function inference. A wide range of machine learning approaches have been developed to automatically predict disordered regions of proteins. One key factor of the success of these methods is the way in which protein information is encoded into features. Recently, we have proposed a systematic methodology to study the relevance of various feature encodings in the context of disulfide connectivity pattern prediction. In the present paper, we adapt this methodology to the problem of predicting disordered regions and assess it on proteins from the 10th CASP competition, as well as on a very large subset of proteins extracted from PDB. Our results, obtained with ensembles of extremely randomized trees, highlight a novel feature function encoding the proximity of residues according to their accessibility to the solvent, which is playing the second most important role in the prediction of disordered regions, just after evolutionary information. Furthermore, even though our approach treats each residue independently, our results are very competitive in terms of accuracy with respect to the state-of-the-art. A web-application is available at http://m24.giga.ulg.ac.be:81/x3Disorder.
Highlights
Disordered regions refer to regions in proteins that do not adopt a stable three-dimensional structure when they are not in presence of their partner molecules
Several automatic methodologies have been proposed to predict disordered regions from primary sequences. They range from simple methods based on the sequence complexity [5] to more sophisticated machine learning approaches often relying on neural networks or Support Vector Machines (SVMs)[6,7,8,9,10]
The first part presents the results of the main contribution of this paper, which aims at determining a relevant representation on Disorder723
Summary
Disordered regions refer to regions in proteins that do not adopt a stable three-dimensional structure when they are not in presence of their partner molecules. Several experimental studies have shown that proteins with disordered regions play various and critical functions in many biological processes. The flexibility of these regions makes it possible for a protein to interact, recognize and bind to many partners. Several automatic methodologies have been proposed to predict disordered regions from primary sequences They range from simple methods based on the sequence complexity [5] to more sophisticated machine learning approaches often relying on neural networks or Support Vector Machines (SVMs)[6,7,8,9,10]. For more information about disordered regions predictors, one can refer to the reports of these assessments [17] or to the recent comprehensive overview of computational protein disorder prediction methods made by Deng et al [18]
Published Version (Free)
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.