Natural language processing in text mining for structural modeling of protein complexes

Varsha D Badal,Ilya A Vakser,Petras J Kundrotas

doi:10.1186/s12859-018-2079-4

Varsha D Badal, Ilya A Vakser + Show 1 more

Open Access

https://doi.org/10.1186/s12859-018-2079-4

Copy DOI

Journal: BMC Bioinformatics	Publication Date: Mar 5, 2018
Citations: 29	License type: open-access

Affiliation: University of Kansas

Abstract

BackgroundStructural modeling of protein-protein interactions produces a large number of putative configurations of the protein complexes. Identification of the near-native models among them is a serious challenge. Publicly available results of biomedical research may provide constraints on the binding mode, which can be essential for the docking. Our text-mining (TM) tool, which extracts binding site residues from the PubMed abstracts, was successfully applied to protein docking (Badal et al., PLoS Comput Biol, 2015; 11: e1004630). Still, many extracted residues were not relevant to the docking.ResultsWe present an extension of the TM tool, which utilizes natural language processing (NLP) for analyzing the context of the residue occurrence. The procedure was tested using generic and specialized dictionaries. The results showed that the keyword dictionaries designed for identification of protein interactions are not adequate for the TM prediction of the binding mode. However, our dictionary designed to distinguish keywords relevant to the protein binding sites led to considerable improvement in the TM performance. We investigated the utility of several methods of context analysis, based on dissection of the sentence parse trees. The machine learning-based NLP filtered the pool of the mined residues significantly more efficiently than the rule-based NLP. Constraints generated by NLP were tested in docking of unbound proteins from the DOCKGROUND X-ray benchmark set 4. The output of the global low-resolution docking scan was post-processed, separately, by constraints from the basic TM, constraints re-ranked by NLP, and the reference constraints. The quality of a match was assessed by the interface root-mean-square deviation. The results showed significant improvement of the docking output when using the constraints generated by the advanced TM with NLP.ConclusionsThe basic TM procedure for extracting protein-protein binding site residues from the PubMed abstracts was significantly advanced by the deep parsing (NLP techniques for contextual analysis) in purging of the initial pool of the extracted residues. Benchmarking showed a substantial increase of the docking success rate based on the constraints generated by the advanced TM with NLP.

Highlights

Structural modeling of protein-protein interactions produces a large number of putative configurations of the protein complexes
We present an advancement of our basic TM procedure based on the deep parsing (NLP techniques for contextual analysis of the abstract sentences) for purging of the initial pool of the extracted residues
Outline of the text-mining protocol The TM procedure was tested on 579 protein-protein complexes from the DOCKGROUND resource [38]

Summary

Introduction

Structural modeling of protein-protein interactions produces a large number of putative configurations of the protein complexes. Our text-mining (TM) tool, which extracts binding site residues from the PubMed abstracts, was successfully applied to protein docking (Badal et al, PLoS Comput Biol, 2015; 11: e1004630). Due to the limitations of the experimental techniques, most structures have to be modeled by either free or template-based docking [1]. Both docking paradigms produce a large pool of putative models, and selecting the correct one is a non-trivial task, performed by scoring procedures [2]. Automated text mining (TM) tools utilizing online availability of indexed scientific literature (e.g. PubMed https:// www.ncbi.nlm.nih.gov/ pubmed) are becoming increasingly important, employing Natural Language Processing (NLP) algorithms to purge non-relevant information from the initial

Methods

Results

Conclusion

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Natural language processing in text mining for structural modeling of protein complexes

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: BMC Bioinformatics

Lead the way for us

Similar Papers

Empowering web portal users with personalized text mining services
Fedor Bakalov ...
EMBnet.journal | VOL. 18
Fedor Bakalov, et. al.Fedor Bakalov ...
09 Nov 2012
EMBnet.journal | VOL. 18

Information Retrieval in Biomedicine: Natural Language Processing for Knowledge Integration
Martha F Earl
Journal of the Medical Library Association : JMLA | VOL. 98
Martha F EarlMartha F Earl
01 Apr 2010
Journal of the Medical Library Association : JMLA | VOL. 98

Natural Language Processing and Computational Linguistics
Junichi Tsujii
Computational Linguistics | VOL. -
Junichi TsujiiJunichi Tsujii
07 Dec 2021
Computational Linguistics | VOL. -

Proceedings of the 2013 international workshop on Mining unstructured big data using natural language processing
...
-
, et. al. ...
28 Oct 2013
Proceedings of the 2013 international workshop on Mining unstructured big data using natural language processing
...

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Natural language processing in text mining for structural modeling of protein complexes

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: BMC Bioinformatics