Abstract

BackgroundControlled vocabularies such as the Unified Medical Language System (UMLS®) and Medical Subject Headings (MeSH®) are widely used for biomedical natural language processing (NLP) tasks. However, the standard terminology in such collections suffers from low usage in biomedical literature, e.g. only 13% of UMLS terms appear in MEDLINE®.ResultsWe here propose an efficient and effective method for extracting noun phrases for biomedical semantic categories. The proposed approach utilizes simple linguistic patterns to select candidate noun phrases based on headwords, and a machine learning classifier is used to filter out noisy phrases. For experiments, three NLP rules were tested and manually evaluated by three annotators. Our approaches showed over 93% precision on average for the headwords, “gene”, “protein”, “disease”, “cell” and “cells”.ConclusionsAlthough biomedical terms in knowledge-rich resources may define semantic categories, variations of the controlled terms in literature are still difficult to identify. The method proposed here is an effort to narrow the gap between controlled vocabularies and the entities used in text. Our extraction method cannot completely eliminate manual evaluation, however a simple and automated solution with high precision performance provides a convenient way for enriching semantic categories by incorporating terms obtained from the literature.Electronic supplementary materialThe online version of this article (doi:10.1186/s12859-015-0487-2) contains supplementary material, which is available to authorized users.

Highlights

  • Controlled vocabularies such as the Unified Medical Language System (UMLS®) and Medical Subject Headings (MeSH®) are widely used for biomedical natural language processing (NLP) tasks

  • Due to the rapid growth of biomedical literature, machine learning and natural language processing (NLP) techniques have gained in popularity forautomatically extracting useful information [1]

  • Dataset The proposed method requires a training set for the support vector machine (SVM) classifier

Read more

Summary

Introduction

Controlled vocabularies such as the Unified Medical Language System (UMLS®) and Medical Subject Headings (MeSH®) are widely used for biomedical natural language processing (NLP) tasks. Due to the rapid growth of biomedical literature, machine learning and natural language processing (NLP) techniques have gained in popularity for (semi-)automatically extracting useful information [1]. Approaches for term identification fall into three categories [1,4,7]: dictionarybased, rule-based and statistical-based. Dictionary-based approaches utilize existing terminological resources in order to identify term occurrences in text [4]. Rule-based approaches find terms by building rules that describe naming structures for a certain concept [11,12,13]. These methods accurately identify known patterns, manual rule construction is costly and time-consuming. It is challenging to choose a set of discriminating features in statistical approaches

Objectives
Methods
Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call