Comparative analysis of five protein-protein interaction corpora.

Sampo Pyysalo,Juho Heimonen,Jari Björne,Antti Airola,Tapio Salakoski,Filip Ginter

doi:10.1186/1471-2105-9-s3-s6

Sampo Pyysalo, Juho Heimonen + Show 4 more

Open Access

https://doi.org/10.1186/1471-2105-9-s3-s6

Copy DOI

Abstract

BackgroundGrowing interest in the application of natural language processing methods to biomedical text has led to an increasing number of corpora and methods targeting protein-protein interaction (PPI) extraction. However, there is no general consensus regarding PPI annotation and consequently resources are largely incompatible and methods are difficult to evaluate.ResultsWe present the first comparative evaluation of the diverse PPI corpora, performing quantitative evaluation using two separate information extraction methods as well as detailed statistical and qualitative analyses of their properties. For the evaluation, we unify the corpus PPI annotations to a shared level of information, consisting of undirected, untyped binary interactions of non-static types with no identification of the words specifying the interaction, no negations, and no interaction certainty.We find that the F-score performance of a state-of-the-art PPI extraction method varies on average 19 percentage units and in some cases over 30 percentage units between the different evaluated corpora. The differences stemming from the choice of corpus can thus be substantially larger than differences between the performance of PPI extraction methods, which suggests definite limits on the ability to compare methods evaluated on different resources. We analyse a number of potential sources for these differences and identify factors explaining approximately half of the variance. We further suggest ways in which the difficulty of the PPI extraction tasks codified by different corpora can be determined to advance comparability. Our analysis also identifies points of agreement and disagreement in PPI corpus annotation that are rarely explicitly stated by the authors of the corpora.ConclusionsOur comparative analysis uncovers key similarities and differences between the diverse PPI corpora, thus taking an important step towards standardization. In the course of this study we have created a major practical contribution in converting the corpora into a shared format. The conversion software is freely available at .

Highlights

Growing interest in the application of natural language processing methods to biomedical text has led to an increasing number of corpora and methods targeting protein-protein interaction (PPI) extraction
We find that the F-score performance of a state-of-the-art PPI extraction method varies on average 19 percentage units and in some cases over 30 percentage units between the different evaluated corpora
The differences stemming from the choice of corpus can be substantially larger than differences between the performance of PPI extraction methods, which suggests definite limits on the ability to compare methods evaluated on different resources

Summary

Introduction

Growing interest in the application of natural language processing methods to biomedical text has led to an increasing number of corpora and methods targeting protein-protein interaction (PPI) extraction. Proteinprotein interactions (PPI) are the most widely studied information extraction (IE) target in the BioNLP field, with the key subproblem of protein name recognition being the most commonly addressed task. Recent shared tasks and studies of biomedical named entities have increasingly clarified the concept of a protein name, brought about a rough consensus on how to annotate them, and established both the best-performing entity name recognition methods and their performance The BioNLP community faces a situation where it is difficult, if not impossible, to reliably identify the best published methods and techniques due to a lack of information on the comparability of their evaluated performance

Objectives

Methods

Results

Conclusion