Improving Classification of Protein Interaction Articles Using Context Similarity-Based Feature Selection.

Yifei Chen,Yuxing Sun,Bing-Qing Han

doi:10.1155/2015/751646

Yifei Chen, Yuxing Sun + Show 1 more

Open Access

https://doi.org/10.1155/2015/751646

Copy DOI

Journal: BioMed research international	Publication Date: Jan 1, 2015
Citations: 23	License type: CC BY 3.0

Affiliation: Nanjing Audit University

Abstract

Protein interaction article classification is a text classification task in the biological domain to determine which articles describe protein-protein interactions. Since the feature space in text classification is high-dimensional, feature selection is widely used for reducing the dimensionality of features to speed up computation without sacrificing classification performance. Many existing feature selection methods are based on the statistical measure of document frequency and term frequency. One potential drawback of these methods is that they treat features separately. Hence, first we design a similarity measure between the context information to take word cooccurrences and phrase chunks around the features into account. Then we introduce the similarity of context information to the importance measure of the features to substitute the document and term frequency. Hence we propose new context similarity-based feature selection methods. Their performance is evaluated on two protein interaction article collections and compared against the frequency-based methods. The experimental results reveal that the context similarity-based methods perform better in terms of the F1 measure and the dimension reduction rate. Benefiting from the context information surrounding the features, the proposed methods can select distinctive features effectively for protein interaction article classification.

Highlights

An overwhelming number of biological articles are published daily online as a result of growing interest in biological research, especially relating to the study of protein-protein interactions (PPIs)
Since the applications are restricted to interaction article classification (IAC), which is a binary classification task, we measure the performance in terms of F1 measure [20]
Novel context similarity-based feature selection methods were introduced for text classification in the biological domain to classify protein interaction articles

Summary

Introduction

An overwhelming number of biological articles are published daily online as a result of growing interest in biological research, especially relating to the study of protein-protein interactions (PPIs). It is essential to classify which articles describe PPIs, that is, to filter out those irrelevant articles from the whole collection of the biological literature. This allows a more efficient extraction of PPIs from the large amount of biological literature. Automated text classification is a key technology to rapidly find relevant articles. Research on protein interaction article classification (IAC) is a text classification task with practical significance in the biological domain

Methods

Results

Discussion

Conclusion