Abstract
BackgroundWe propose a method for automatic extraction of protein-specific residue mentions from the biomedical literature. The method searches text for mentions of amino acids at specific sequence positions and attempts to correctly associate each mention with a protein also named in the text. The methods presented in this work will enable improved protein functional site extraction from articles, ultimately supporting protein function prediction. Our method made use of linguistic patterns for identifying the amino acid residue mentions in text. Further, we applied an automated graph-based method to learn syntactic patterns corresponding to protein-residue pairs mentioned in the text. We finally present an approach to automated construction of relevant training and test data using the distant supervision model.ResultsThe performance of the method was assessed by extracting protein-residue relations from a new automatically generated test set of sentences containing high confidence examples found using distant supervision. It achieved a F-measure of 0.84 on automatically created silver corpus and 0.79 on a manually annotated gold data set for this task, outperforming previous methods.ConclusionsThe primary contributions of this work are to (1) demonstrate the effectiveness of distant supervision for automatic creation of training data for protein-residue relation extraction, substantially reducing the effort and time involved in manual annotation of a data set and (2) show that the graph-based relation extraction approach we used generalizes well to the problem of protein-residue association extraction. This work paves the way towards effective extraction of protein functional residues from the literature.
Highlights
We propose a method for automatic extraction of protein-specific residue mentions from the biomedical literature
Through this work we have demonstrated that the application of a subgraph matchingbased relation extraction approach generalizes well to the problem of extracting proteinresidue associations
The task itself has broader significance for protein function prediction and subsequent drug discovery, given the context of our ongoing research of into integrating evidence extracted from the biomedical literature into a protein function prediction system [2,3]
Summary
We propose a method for automatic extraction of protein-specific residue mentions from the biomedical literature. The method searches text for mentions of amino acids at specific sequence positions and attempts to correctly associate each mention with a protein named in the text. Our method made use of linguistic patterns for identifying the amino acid residue mentions in text. We applied an automated graph-based method to learn syntactic patterns corresponding to protein-residue pairs mentioned in the text. In the context of three-dimensional protein structures, the appearance of certain amino acid residues at key structural positions has a central role in protein function, for instance enabling ligand or substrate binding. Efforts to manually catalog functional sites mentioned in the literature are helping but will not fill this gap in the near future, considering the growing pace of the biomedical literature. The overarching goal of our work is to identify such functional sites automatically from the biomedical literature
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.