Abstract

To reduce the increasing amount of time spent on literature search in the life sciences, several methods for automated knowledge extraction have been developed. Co-occurrence based approaches can deal with large text corpora like MEDLINE in an acceptable time but are not able to extract any specific type of semantic relation. Semantic relation extraction methods based on syntax trees, on the other hand, are computationally expensive and the interpretation of the generated trees is difficult. Several natural language processing (NLP) approaches for the biomedical domain exist focusing specifically on the detection of a limited set of relation types. For systems biology, generic approaches for the detection of a multitude of relation types which in addition are able to process large text corpora are needed but the number of systems meeting both requirements is very limited. We introduce the use of SENNA (“Semantic Extraction using a Neural Network Architecture”), a fast and accurate neural network based Semantic Role Labeling (SRL) program, for the large scale extraction of semantic relations from the biomedical literature. A comparison of processing times of SENNA and other SRL systems or syntactical parsers used in the biomedical domain revealed that SENNA is the fastest Proposition Bank (PropBank) conforming SRL program currently available. 89 million biomedical sentences were tagged with SENNA on a 100 node cluster within three days. The accuracy of the presented relation extraction approach was evaluated on two test sets of annotated sentences resulting in precision/recall values of 0.71/0.43. We show that the accuracy as well as processing speed of the proposed semantic relation extraction approach is sufficient for its large scale application on biomedical text. The proposed approach is highly generalizable regarding the supported relation types and appears to be especially suited for general-purpose, broad-scale text mining systems. The presented approach bridges the gap between fast, cooccurrence-based approaches lacking semantic relations and highly specialized and computationally demanding NLP approaches.

Highlights

  • The rapidly increasing amount of biomedical publications is a key resource for the automated extraction and inference of relations between biomedical concepts such as protein-protein interactions or regulatory interrelations

  • We presented the use of a novel, Semantic Role Labeling (SRL) (SENNA) based approach for fast and reliable semantic role labeling of biomedical text corpora

  • For instance, could be the extraction of protein transport relations mentioned within GeneRIFs, a set of sentences in the Entrez Gene database describing the function of a gene, where 85% of the protein transport predicates were reported to be used as nouns [22]

Read more

Summary

Introduction

The rapidly increasing amount of biomedical publications is a key resource for the automated extraction and inference of relations between biomedical concepts such as protein-protein interactions or regulatory interrelations. SENNA [14,15], a semantic role labeling program trained on the PropBank corpus, does not rely on the extraction of syntax trees for assigning semantic roles to sentence constituents Instead, it uses a radically different approach compared to the existing SRL programs: skipping the step of syntax tree generation, SENNA’s neural network architecture was trained directly on some basic, quickly derivable sentence features. In order to assess the applicability of SRL for extracting relations between biomedical entities, we examined how often the simplifying assumption holds true that all entities in the ARG0/ ARG1 parts generated by a SRL program act as actor/ target in the sense of the verb This question is of crucial importance to assess whether the proposed SRL based approach can be used with sufficient reliability to build up a large scale biomedical text mining system. We choose SENNA for the evaluation of SRL based relation extraction (RE), applied SENNA to almost 90 million MEDLINE sentences and compared its speed with syntactic parsers commonly used for relation extraction in the biological domain

Methods
Results and Discussion
Conclusions
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call