Benchmarking natural-language parsers for biological applications using dependency graphs.

Andrew B Clegg,Adrian J Shepherd

doi:10.1186/1471-2105-8-24

Abstract

BackgroundInterest is growing in the application of syntactic parsers to natural language processing problems in biology, but assessing their performance is difficult because differences in linguistic convention can falsely appear to be errors. We present a method for evaluating their accuracy using an intermediate representation based on dependency graphs, in which the semantic relationships important in most information extraction tasks are closer to the surface. We also demonstrate how this method can be easily tailored to various application-driven criteria.ResultsUsing the GENIA corpus as a gold standard, we tested four open-source parsers which have been used in bioinformatics projects. We first present overall performance measures, and test the two leading tools, the Charniak-Lease and Bikel parsers, on subtasks tailored to reflect the requirements of a system for extracting gene expression relationships. These two tools clearly outperform the other parsers in the evaluation, and achieve accuracy levels comparable to or exceeding native dependency parsers on similar tasks in previous biological evaluations.ConclusionEvaluating using dependency graphs allows parsers to be tested easily on criteria chosen according to the semantics of particular biological applications, drawing attention to important mistakes and soaking up many insignificant differences that would otherwise be reported as errors. Generating high-accuracy dependency graphs from the output of phrase-structure parsers also provides access to the more detailed syntax trees that are used in several natural-language processing techniques.

Highlights

Interest is growing in the application of syntactic parsers to natural language processing problems in biology, but assessing their performance is difficult because differences in linguistic convention can falsely appear to be errors
The software packages used in our evaluation are the Bikel parser [15], the Collins parser [16], the Stanford parser [17,18] and the Charniak parser [19] – including a modified version known as the Charniak-Lease parser [20]. All of these are widely used by the computational linguistics community, and have been employed to parse molecular biology data, despite having been developed and trained on sentences from the Penn Treebank
Our gold standard corpus was 1757 sentences from the GENIA treebank [13], which were mapped from their original tree structures to dependency graphs by the same deterministic algorithm from the Stanford toolkit that we used to convert the output of each parser [9]

Summary

Introduction

Interest is growing in the application of syntactic parsers to natural language processing problems in biology, but assessing their performance is difficult because differences in linguistic convention can falsely appear to be errors. Much headway has been made using text processing methods based on linear pattern matching (e.g. regular expressions), the diversity and complexity of natural language has caused many researchers to integrate more sophisticated parsing methods into their biological NLP pipelines [6,7]. This enables NLP systems to take into account the grammatical content of each sentence, including deeply nested structures, and dependencies between widely separated words or phrases that are hard to capture with superficial patterns. Instead they produce a graph for each sentence, where each node represents a word, and each arc a grammatical dependency such as that which holds between a verb and its subject (see Figure 2)

Methods

Results

Conclusion

Full Text

Paper version not known

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: BMC bioinformatics	Publication Date: Jan 25, 2007
Citations: 135	License type: CC BY 2.0

R Discovery Prime

R Discovery Prime

Benchmarking natural-language parsers for biological applications using dependency graphs.

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: BMC bioinformatics

Lead the way for us

Similar Papers

BIOSMILE: a semantic role labeling system for biomedical verbs using a maximum-entropy model with automatically generated template features.
Richard Tzong-Han Tsai ... Wen-Chi Chou
BMC bioinformatics | VOL. 8
Richard Tzong-Han Tsai, et. al.Richard Tzong-Han Tsai ... Wen-Chi Chou
01 Sep 2007
BMC bioinformatics | VOL. 8

Cross-Task Instance Representation Interactions and Label Dependencies for Joint Information Extraction with Graph Convolutional Networks
...
-
, et. al. ...
25 May 2021
25 May 2021

Cross-Task Instance Representation Interactions and Label Dependencies for Joint Information Extraction with Graph Convolutional Networks
Minh Van Nguyen ... Viet Lai
-
Minh Van Nguyen, et. al.Minh Van Nguyen ... Viet Lai
01 Jan 2020
01 Jan 2020

GENIA corpus--semantically annotated corpus for bio-textmining.
J.-D Kim ... T Ohta
Computer applications in the biosciences : CABIOS | VOL. Suppl 19 1
J.-D Kim, et. al.J.-D Kim ... T Ohta
03 Jul 2003
Computer applications in the biosciences : CABIOS | VOL. Suppl 19 1

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Benchmarking natural-language parsers for biological applications using dependency graphs.

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: BMC bioinformatics