Abstract

Published literature is an important source of knowledge supporting biomedical research. Given the large and increasing number of publications, automated document classification plays an important role in biomedical research. Effective biomedical document classifiers are especially needed for bio-databases, in which the information stems from many thousands of biomedical publications that curators must read in detail and annotate. In addition, biomedical document classification often amounts to identifying a small subset of relevant publications within a much larger collection of available documents. As such, addressing class imbalance is essential to a practical classifier. We present here an effective classification scheme for automatically identifying papers among a large pool of biomedical publications that contain information relevant to a specific topic, which the curators are interested in annotating. The proposed scheme is based on a meta-classification framework using cluster-based under-sampling combined with named-entity recognition and statistical feature selection strategies. We examined the performance of our method over a large imbalanced data set that was originally manually curated by the Jackson Laboratory’s Gene Expression Database (GXD). The set consists of more than 90 000 PubMed abstracts, of which about 13 000 documents are labeled as relevant to GXD while the others are not relevant. Our results, 0.72 precision, 0.80 recall and 0.75 f-measure, demonstrate that our proposed classification scheme effectively categorizes such a large data set in the face of data imbalance.

Highlights

  • The published literature is an important source of biomedical knowledge, as much information is conveyed in the form of publications

  • As such, automated biomedical document classification has attracted much interest [1,2,3,4,5,6]. It is especially needed for the bio-databases curation workflow, as much information is manually curated within such databases [7], e.g. the Mouse Genome Informatics (MGI) database [8]

  • While using Naïve Bayes as the metaclassifier leads to the highest recall and Random Forest attains the highest precision, Support Vector Machines (SVMs) significantly outperforms both in terms of f-measure and Matthews correlation coefficient (MCC) (P 0.001, two sample t-test), striking a good balance between precision and recall

Read more

Summary

Introduction

The published literature is an important source of biomedical knowledge, as much information is conveyed in the form of publications. One way to address this challenge is through automated document classification, that is, identifying publications relevant to a specific topic within a large collection of articles. As such, automated biomedical document classification has attracted much interest [1,2,3,4,5,6] It is especially needed for the bio-databases curation workflow, as much information is manually curated within such databases [7], e.g. the Mouse Genome Informatics (MGI) database [8]. The MGI database forms the most extensive international resource for the laboratory mouse It provides integrated genetic, genomic and biological data for facilitating the study of human health and disease. Publications that report on endogenous gene expression during development and in postnatal stages are included. Excluded from the collection are studies reporting ectopic gene expression via the use of transgenes, experiments studying the effects of treatments or other external/environmental factors or papers that report only on postnatal gene expression

Methods
Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call