Abstract

Proteins are molecular machines playing almost every fundamental role in activities of life. Their biological functions are mostly driven through conformational transitions and interaction interfaces with other bio-molecules such as DNA sequences, proteins and other ligands. In quest of the mechanism underlying protein functions, I conducted two projects aiming, firstly, to explore the structural change of proteins via identifying their rigid bodies, and secondly, to devise new sequence-based features to predict DNA-binding sites in proteins. Despite many previous efforts to calculate rigid domains in proteins, it is still highly desirable to develop new segmentation algorithms which are able to efficiently segment high-throughput of proteins, meanwhile to avoid protein-dependent parameters tuning such as the number of rigid domains. Thus, I introduce a new rigid domain segmentation method where I use a graph whose vertices are amino acids to represent multiple conformational states of a protein. This graph is later reduced by a coarse graining such as the Louvain clustering algorithm. Afterward, the domain-wise relationships among clusters in the reduced graph were inferred through a binary labeling of its edges which becomes feasible thanks to the line graph transformation and generalized Viterbi algorithm. Because of the binary labeling, our method does not require the number of rigid domains as an input parameter like other existing methods. I validate our graph-based method on 487 examples from DynDom database and compare our segments with other methods on several proteins whose structural changes range from medium to large and their molecular motions have been studied extensively in the literature. The algorithm code as well as usage instruction is available at https://github.com/dtklinh/GBRDE. In the second project, the identification of DNA-binding sites in proteins could be obtained either through structure- or sequence-based approaches. In spite of obtaining good results, structure-based methods require protein 3D structures which are expensive and time-consuming. In contrast, the sequence-based ones are efficiently applicable to entire protein databases, yet demand carefully designed features. Thus, I present a new information theoretic feature extracted from the Jensen–Shannon Divergence (JSD) where I harvest the differences between amino acids distributions of binding and non-binding sites. For the evaluation, I ran a five-fold cross validation on 263 proteins with Random Forest (RF) classifier along with features comprising of our new sequence-based feature and several popular ones such as position-specific scoring matrix (PSSM), orthogonal binary vector (OBV), and secondary structure (SS). The results show that by concatenating our features, there is a significant improvement of RF classifier performance in terms of sensitivity and Matthews correlation coefficient (MCC).

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.