Abstract

BackgroundIdentification of terms is essential for biomedical text mining.. We concentrate here on the use of vocabularies for term identification, specifically the Unified Medical Language System (UMLS). To make the UMLS more suitable for biomedical text mining we implemented and evaluated nine term rewrite and eight term suppression rules. The rules rely on UMLS properties that have been identified in previous work by others, together with an additional set of new properties discovered by our group during our work with the UMLS. Our work complements the earlier work in that we measure the impact on the number of terms identified by the different rules on a MEDLINE corpus. The number of uniquely identified terms and their frequency in MEDLINE were computed before and after applying the rules. The 50 most frequently found terms together with a sample of 100 randomly selected terms were evaluated for every rule.ResultsFive of the nine rewrite rules were found to generate additional synonyms and spelling variants that correctly corresponded to the meaning of the original terms and seven out of the eight suppression rules were found to suppress only undesired terms. Using the five rewrite rules that passed our evaluation, we were able to identify 1,117,772 new occurrences of 14,784 rewritten terms in MEDLINE. Without the rewriting, we recognized 651,268 terms belonging to 397,414 concepts; with rewriting, we recognized 666,053 terms belonging to 410,823 concepts, which is an increase of 2.8% in the number of terms and an increase of 3.4% in the number of concepts recognized. Using the seven suppression rules, a total of 257,118 undesired terms were suppressed in the UMLS, notably decreasing its size. 7,397 terms were suppressed in the corpus.ConclusionsWe recommend applying the five rewrite rules and seven suppression rules that passed our evaluation when the UMLS is to be used for biomedical term identification in MEDLINE. A software tool to apply these rules to the UMLS is freely available at http://biosemantics.org/casper.

Highlights

  • Identification of terms is essential for biomedical text mining

  • The suppression rules on the other hand were implemented to rid the Unified Medical Language System (UMLS) of terms that are undesired when it comes to term identification either because they affect the precision of the term identification, e.g. the synonym “2” for the term “clinical class”, the synonym “EC 2.7.1.-” for the concept “human CDC7 protein”, or because they affect the efficiency of the term identification, i.e. long and vague terms that are unlikely to be found in text such as the term “poisoning by other and unspecified drugs and medicinal substances” or terms that are useless for concept identification such as the concept with the single term “WHILE”

  • Our work complements the work by McCray et al and Rogers and Aronson in that we measured the impact on the number of terms identified by the different rules on all of MEDLINE (1965-2007), whereas the others only reported the number of strings in the UMLS that were affected by the specific string properties, and in that we performed a manual analysis of the rewritten terms retrieved from the corpus and of the terms that were suppressed in the corpus

Read more

Summary

Introduction

Identification of terms is essential for biomedical text mining. We concentrate here on the use of vocabularies for term identification, the Unified Medical Language System (UMLS). Approaches to term identification generally fall into three categories: lexicon-based systems, rule-based systems, and statisticsbased systems making use of different machine learning techniques [15]. The lexicon-based approach deals with general medical terms for which it is difficult to design general matching patterns that are used by rule-based systems. It provides information concerning the semantic relations between terms and supports synonym and referent data source mapping, which is not possible using rule-based or statistically-based term identification. NLM checks terms from different vocabularies for synonymy, assigns a unique concept identifier (CUI) and assigns concepts to one or more semantic types from the UMLS Semantic Network

Objectives
Methods
Results
Discussion
Conclusion

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.