Training text chunkers on a silver standard corpus: can silver replace gold?

Ning Kang,Erik M Van Mulligen,Jan A Kors

doi:10.1186/1471-2105-13-17

Ning Kang, Erik M Van Mulligen + Show 1 more

Open Access

PDF Available

https://doi.org/10.1186/1471-2105-13-17

Copy DOI

Export

Save

Cite

Journal: BMC Bioinformatics	Publication Date: Jan 30, 2012
Citations: 17	License type: CC BY 2.0

Affiliation: Erasmus MC

Abstract
Highlights/Summary
Full-Text PDF
Similar Papers

Abstract

Listen

BackgroundTo train chunkers in recognizing noun phrases and verb phrases in biomedical text, an annotated corpus is required. The creation of gold standard corpora (GSCs), however, is expensive and time-consuming. GSCs therefore tend to be small and to focus on specific subdomains, which limits their usefulness. We investigated the use of a silver standard corpus (SSC) that is automatically generated by combining the outputs of multiple chunking systems. We explored two use scenarios: one in which chunkers are trained on an SSC in a new domain for which a GSC is not available, and one in which chunkers are trained on an available, although small GSC but supplemented with an SSC.ResultsWe have tested the two scenarios using three chunkers, Lingpipe, OpenNLP, and Yamcha, and two different corpora, GENIA and PennBioIE. For the first scenario, we showed that the systems trained for noun-phrase recognition on the SSC in one domain performed 2.7-3.1 percentage points better in terms of F-score than the systems trained on the GSC in another domain, and only 0.2-0.8 percentage points less than when they were trained on a GSC in the same domain as the SSC. When the outputs of the chunkers were combined, the combined system showed little improvement when using the SSC. For the second scenario, the systems trained on a GSC supplemented with an SSC performed considerably better than systems that were trained on the GSC alone, especially when the GSC was small. For example, training the chunkers on a GSC consisting of only 10 abstracts but supplemented with an SSC yielded similar performance as training them on a GSC of 100-250 abstracts. The combined system even performed better than any of the individual chunkers trained on a GSC of 500 abstracts.ConclusionsWe conclude that an SSC can be a viable alternative for or a supplement to a GSC when training chunkers in a biomedical domain. A combined system only shows improvement if the SSC is used to supplement a GSC. Whether the approach is applicable to other systems in a natural-language processing pipeline has to be further investigated.

Highlights

To train chunkers in recognizing noun phrases and verb phrases in biomedical text, an annotated corpus is required
The performance difference between the combined systems based on GENIA gold standard corpus (GSC) and PennBioIE silver standard corpus (SSC) is only small (0.2 percentage point)
To test the consistency of this result, we redid the experiment with interchanged corpora, i.e., GENIA GSC was used for training the chunkers and generating the SSC, and PennBioIE GSC was used for testing

Summary

Introduction

To train chunkers in recognizing noun phrases and verb phrases in biomedical text, an annotated corpus is required. We investigated the use of a silver standard corpus (SSC) that is automatically generated by combining the outputs of multiple chunking systems. Chunking is a natural language processing technique that splits text into groups of words that constitute a grammatical unit, e.g., a noun phrase or a verb phrase. It is an important processing step in systems that try to automatically extract information from text. The creation of a gold standard corpus (GSC) is tedious and expensive: annotation guidelines have to be established, domain experts must be trained, the annotation process is time-consuming, and. We postulate that the annotations of such a combined system on a given corpus can be taken as a reference standard, establishing a “silver standard corpus” (SSC)

Methods

Results

Discussion

Conclusion