Linguistic feature analysis for protein interaction extraction.

Timur Fayruzov,Chris Cornelis,Veronique Hoste,Martine De Cock

doi:10.1186/1471-2105-10-374

Timur Fayruzov, Chris Cornelis + Show 2 more

Open Access

https://doi.org/10.1186/1471-2105-10-374

Copy DOI

Abstract

BackgroundThe rapid growth of the amount of publicly available reports on biomedical experimental results has recently caused a boost of text mining approaches for protein interaction extraction. Most approaches rely implicitly or explicitly on linguistic, i.e., lexical and syntactic, data extracted from text. However, only few attempts have been made to evaluate the contribution of the different feature types. In this work, we contribute to this evaluation by studying the relative importance of deep syntactic features, i.e., grammatical relations, shallow syntactic features (part-of-speech information) and lexical features. For this purpose, we use a recently proposed approach that uses support vector machines with structured kernels.ResultsOur results reveal that the contribution of the different feature types varies for the different data sets on which the experiments were conducted. The smaller the training corpus compared to the test data, the more important the role of grammatical relations becomes. Moreover, deep syntactic information based classifiers prove to be more robust on heterogeneous texts where no or only limited common vocabulary is shared.ConclusionOur findings suggest that grammatical relations play an important role in the interaction extraction task. Moreover, the net advantage of adding lexical and shallow syntactic features is small related to the number of added features. This implies that efficient classifiers can be built by using only a small fraction of the features that are typically being used in recent approaches.

Highlights

The rapid growth of the amount of publicly available reports on biomedical experimental results has recently caused a boost of text mining approaches for protein interaction extraction
Our findings suggest that grammatical relations play an important role in the interaction extraction task
We study the impact of different feature types on the performance of a relation extraction system that uses a support vector machine (SVM) classifier with kernels as its core, since at present this is the most popular choice in the relation extraction field

Summary

Introduction

The rapid growth of the amount of publicly available reports on biomedical experimental results has recently caused a boost of text mining approaches for protein interaction extraction. We contribute to this evaluation by studying the relative importance of deep syntactic features, i.e., grammatical relations, shallow syntactic features (partof-speech information) and lexical features. For this purpose, we use a recently proposed approach that uses support vector machines with structured kernels. An overwhelming amount of experimental studies on gene and protein interactions are being conducted The results of these experiments are most often described as scientific reports or articles and published in public knowledge repositories, such as Medline http:// www.ncbi.nlm.nih.gov/. To achieve state-of-the-art performance, researchers employ lexical information (words) along with shallow syntactic information (POS) and/or deep syntactic features (grammatical structures) (see for example [1,2,3,4,5,6,7,8,9,10])

Methods

Results

Discussion

Conclusion