Feature engineering for MEDLINE citation categorization with MeSH

Antonio Jose Jimeno Yepes,Laura Plaza,Jorge Carrillo-De-Albornoz,Alan R Aronson,James G Mork

doi:10.1186/s12859-015-0539-7

Abstract

BackgroundResearch in biomedical text categorization has mostly used the bag-of-words representation. Other more sophisticated representations of text based on syntactic, semantic and argumentative properties have been less studied. In this paper, we evaluate the impact of different text representations of biomedical texts as features for reproducing the MeSH annotations of some of the most frequent MeSH headings. In addition to unigrams and bigrams, these features include noun phrases, citation meta-data, citation structure, and semantic annotation of the citations.ResultsTraditional features like unigrams and bigrams exhibit strong performance compared to other feature sets. Little or no improvement is obtained when using meta-data or citation structure. Noun phrases are too sparse and thus have lower performance compared to more traditional features. Conceptual annotation of the texts by MetaMap shows similar performance compared to unigrams, but adding concepts from the UMLS taxonomy does not improve the performance of using only mapped concepts. The combination of all the features performs largely better than any individual feature set considered. In addition, this combination improves the performance of a state-of-the-art MeSH indexer. Concerning the machine learning algorithms, we find that those that are more resilient to class imbalance largely obtain better performance.ConclusionsWe conclude that even though traditional features such as unigrams and bigrams have strong performance compared to other features, it is possible to combine them to effectively improve the performance of the bag-of-words representation. We have also found that the combination of the learning algorithm and feature sets has an influence in the overall performance of the system. Moreover, using learning algorithms resilient to class imbalance largely improves performance. However, when using a large set of features, consideration needs to be taken with algorithms due to the risk of over-fitting. Specific combinations of learning algorithms and features for individual MeSH headings could further increase the performance of an indexing system.Electronic supplementary materialThe online version of this article (doi:10.1186/s12859-015-0539-7) contains supplementary material, which is available to authorized users.

Highlights

Research in biomedical text categorization has mostly used the bag-of-words representation
Overall we can see that AdaBoostM1 with oversampling and SVM optimized for multi-variate measures perform much better
We could try improving the performance of unigrams by combining them with other feature sets (MTI performance has been shown to improve by combining several sources of information)

Summary

Introduction

Research in biomedical text categorization has mostly used the bag-of-words representation. Each MEDLINE citation is manually assigned a number of relevant medical subject headings that classify the document according to its topic. As stated in [2], MEDLINE indexing is Jimeno Yepes et al BMC Bioinformatics (2015) 16:113 the responsibility of a relatively small group of highly qualified indexing contractors and staff at the NLM who find it difficult to maintain the quality of this huge resource. In this situation, automatic methods to categorize citations might be relevant

Methods

Results

Discussion

Conclusion

Full Text

Published Version (Free)

View/Download pdf

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: BMC Bioinformatics	Publication Date: Apr 8, 2015
Citations: 58	License type: CC BY 4.0

R Discovery Prime

Feature engineering for MEDLINE citation categorization with MeSH

Abstract

Highlights

Summary

Published Version (Free)

Talk to us

Similar Papers

More From: BMC Bioinformatics

Lead the way for us

Similar Papers

Use of Multiprognostic Index Domain Scores, Clinical Data, and Machine Learning to Improve 12-Month Mortality Risk Prediction in Older Hospitalized Patients: Prospective Cohort Study.
Richard John Woodman ... Alberto Pilotto
Journal of medical Internet research | VOL. 23
Richard John Woodman, et. al.Richard John Woodman ... Alberto Pilotto
21 Jun 2021
Journal of medical Internet research | VOL. 23

Machine Learning to Develop and Internally Validate a Predictive Model for Post-operative Delirium in a Prospective, Observational Clinical Cohort Study of Older Surgical Patients.
Annie M Racine ...
Journal of General Internal Medicine | VOL. 36
Annie M Racine, et. al.Annie M Racine ...
19 Oct 2020
Journal of General Internal Medicine | VOL. 36

Comparison of machine learning and semi-quantification algorithms for (I123)FP-CIT classification: the beginning of the end for semi-quantification?
Jonathan Christopher Taylor ... John Wesley Fenner
EJNMMI Physics | VOL. 4
Jonathan Christopher Taylor, et. al.Jonathan Christopher Taylor ... John Wesley Fenner
29 Nov 2017
EJNMMI Physics | VOL. 4

Search, identification, and curation of cell and gene therapy product regulations using augmented intelligent systems.
William Schaut ... Srikanth Ramakrishnan
Frontiers in medicine | VOL. 10
William Schaut, et. al.William Schaut ... Srikanth Ramakrishnan
06 Mar 2023
Frontiers in medicine | VOL. 10

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

Feature engineering for MEDLINE citation categorization with MeSH

Abstract

Highlights

Summary

Published Version (Free)

Talk to us

Similar Papers

More From: BMC Bioinformatics