An open-source framework for large-scale, flexible evaluation of biomedical text mining systems

K Bretonnel Cohen,William A Baumgartner,Lawrence Hunter

doi:10.1186/1747-5333-3-1

Abstract

BackgroundImproved evaluation methodologies have been identified as a necessary prerequisite to the improvement of text mining theory and practice. This paper presents a publicly available framework that facilitates thorough, structured, and large-scale evaluations of text mining technologies. The extensibility of this framework and its ability to uncover system-wide characteristics by analyzing component parts as well as its usefulness for facilitating third-party application integration are demonstrated through examples in the biomedical domain.ResultsOur evaluation framework was assembled using the Unstructured Information Management Architecture. It was used to analyze a set of gene mention identification systems involving 225 combinations of system, evaluation corpus, and correctness measure. Interactions between all three were found to affect the relative rankings of the systems. A second experiment evaluated gene normalization system performance using as input 4,097 combinations of gene mention systems and gene mention system-combining strategies. Gene mention system recall is shown to affect gene normalization system performance much more than does gene mention system precision, and high gene normalization performance is shown to be achievable with remarkably low levels of gene mention system precision.ConclusionThe software presented in this paper demonstrates the potential for novel discovery resulting from the structured evaluation of biomedical language processing systems, as well as the usefulness of such an evaluation framework for promoting collaboration between developers of biomedical language processing technologies. The code base is available as part of the BioNLP UIMA Component Repository on SourceForge.net.

Highlights

Improved evaluation methodologies have been identified as a necessary prerequisite to the improvement of text mining theory and practice
This paper investigates the hypothesis that structured evaluations are a valuable addition to the current paradigm for performance testing of large language processing systems
Structured evaluation has not generally been practiced by the text mining community; we present here a novel and surprising discovery about the interaction between gene mention detection and gene normalization for one GN system and about the high tolerance of this gene normalization system for low gene mention system precision

Summary

Introduction

Improved evaluation methodologies have been identified as a necessary prerequisite to the improvement of text mining theory and practice. This paper presents a publicly available framework that facilitates thorough, structured, and large-scale evaluations of text mining technologies. Daelemans points out that evaluations of machine learning algorithms often produce deceptive or incomplete results due to ignoring the complex interactions that characterize both language and language processing tasks on one hand, and machine learning algorithms on the other. Some of these interactions are related to aspects of machine learning systems such as interactions between algorithm parameters and sample selection or between algorithm parameters and feature selection. Other interactions come from data – interactions between training set contents and training set size, or between training set and external knowledge sources

Objectives

Results

Discussion

Conclusion