Abstract

BackgroundThe exploitation of information extraction (IE), a technology aiming to provide instances of structured representations from free-form text, has been rapidly growing within the molecular biology (MB) research community to keep track of the latest results reported in literature. IE systems have traditionally used shallow syntactic patterns for matching facts in sentences but such approaches appear inadequate to achieve high accuracy in MB event extraction due to complex sentence structure. A consensus in the IE community is emerging on the necessity for exploiting deeper knowledge structures such as through the relations between a verb and its arguments shown by predicate-argument structure (PAS). PAS is of interest as structures typically correspond to events of interest and their participating entities. For this to be realized within IE a key knowledge component is the definition of PAS frames. PAS frames for non-technical domains such as newswire are already being constructed in several projects such as PropBank, VerbNet, and FrameNet. Knowledge from PAS should enable more accurate applications in several areas where sentence understanding is required like machine translation and text summarization. In this article, we explore the need to adapt PAS for the MB domain and specify PAS frames to support IE, as well as outlining the major issues that require consideration in their construction.ResultsWe introduce PASBio by extending a model based on PropBank to the MB domain. The hypothesis we explore is that PAS holds the key for understanding relationships describing the roles of genes and gene products in mediating their biological functions. We chose predicates describing gene expression, molecular interactions and signal transduction events with the aim of covering a number of research areas in MB. Analysis was performed on sentences containing a set of verbal predicates from MEDLINE and full text journals. Results confirm the necessity to analyze PAS specifically for MB domain.ConclusionsAt present PASBio contains the analyzed PAS of over 30 verbs, publicly available on the Internet for use in advanced applications. In the future we aim to expand the knowledge base to cover more verbs and the nominal form of each predicate.

Highlights

  • The exploitation of information extraction (IE), a technology aiming to provide instances of structured representations from free-form text, has been rapidly growing within the molecular biology (MB) research community to keep track of the latest results reported in literature

  • Traditional IE systems that use regular expressions based on shallow chunking at the phrase level capture weak notions of 'argument' for event predicates and their linear precedence

  • There are three important cases that we examine in comparison to PropBank: (1) verbs that are rarely used in general language or have a unique biological interpretation, (2) verbs that have a similar meaning used in the newswire domain and biology domain but show different patterns of usage, and (3) verbs that are used with the same meaning and

Read more

Summary

Introduction

The exploitation of information extraction (IE), a technology aiming to provide instances of structured representations from free-form text, has been rapidly growing within the molecular biology (MB) research community to keep track of the latest results reported in literature. We are in an era where full genomes, data from high throughput experimental methods (e.g. micro-arrays) and electronic versions of scientific literature are available to every researcher over the Internet. These advances have made it possible to work on more than one gene at a time, ask complex questions and increase the pace of biological discovery. The large-scale data generated from the experiments in molecular biology needs to be assessed and integrated into the scientific communities' knowledge stores This has created a need for various kinds of specialized databases. This is where information extraction (IE) has an important role to play

Objectives
Methods
Results
Conclusion

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.