Abstract

BackgroundLarge biomedical data sets have become increasingly important resources for medical researchers. Modern biomedical data sets are annotated with standard terms to describe the data and to support data linking between databases. The largest curated listing of biomedical terms is the the National Library of Medicine's Unified Medical Language System (UMLS). The UMLS contains more than 2 million biomedical terms collected from nearly 100 medical vocabularies. Many of the vocabularies contained in the UMLS carry restrictions on their use, making it impossible to share or distribute UMLS-annotated research data. However, a subset of the UMLS vocabularies, designated Category 0 by UMLS, can be used to annotate and share data sets without violating the UMLS License Agreement.MethodsThe UMLS Category 0 vocabularies can be extracted from the parent UMLS metathesaurus using a Perl script supplied with this article. There are 43 Category 0 vocabularies that can be used freely for research purposes without violating the UMLS License Agreement. Among the Category 0 vocabularies are: MESH (Medical Subject Headings), NCBI (National Center for Bioinformatics) Taxonomy and ICD-9-CM (International Classification of Diseases-9-Clinical Modifiers).ResultsThe extraction file containing all Category 0 terms and concepts is 72,581,138 bytes in length and contains 1,029,161 terms. The UMLS Metathesaurus MRCON file (January, 2003) is 151,048,493 bytes in length and contains 2,146,899 terms. Therefore the Category 0 vocabularies, in aggregate, are about half the size of the UMLS metathesaurus.A large publicly available listing of 567,921 different medical phrases were automatically coded using the full UMLS metatathesaurus and the Category 0 vocabularies. There were 545,321 phrases with one or more matches against UMLS terms while 468,785 phrases had one or more matches against the Category 0 terms. This indicates that when the two vocabularies are evaluated by their fitness to find at least one term for a medical phrase, the Category 0 vocabularies performed 86% as well as the complete UMLS metathesaurus.ConclusionThe Category 0 vocabularies of UMLS constitute a large nomenclature that can be used by biomedical researchers to annotate biomedical data. These annotated data sets can be distributed for research purposes without violating the UMLS License Agreement. These vocabularies may be of particular importance for sharing heterogeneous data from diverse biomedical data sets. The software tools to extract the Category 0 vocabularies are freely available Perl scripts entered into the public domain and distributed with this article.

Highlights

  • Large biomedical data sets have become increasingly important resources for medical researchers

  • The Category 0 vocabularies of Unified Medical Language System (UMLS) constitute a large nomenclature that can be used by biomedical researchers to annotate biomedical data

  • These annotated data sets can be distributed for research purposes without violating the UMLS License Agreement

Read more

Summary

Introduction

Large biomedical data sets have become increasingly important resources for medical researchers. The proposed data sharing policy has arrived at a time when scientists are challenged by confidentiality issues, intellectual property considerations and a variety of federal regulations that constrain the free distribution of research data. Annotation might involve adding information about the translated proteins or the disease that might be associated with a mutation found in the sequence These annotations are a critical part of the sequence data, because the annotations assist in the discovery of the biological relevance of the sequence. Many medical vocabularies used for annotation are encumbered under proprietary license agreements This often means that scientists cannot freely share their annotated biomedical data sets

Objectives
Methods
Results
Discussion
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call