Abstract

MotivationWith more and more research dedicated to literature mining in the biomedical domain, more and more systems are available for people to choose from when building literature mining applications. In this study, we focus on one specific kind of literature mining task, i.e., detecting definitions of acronyms, abbreviations, and symbols in biomedical text. We denote acronyms, abbreviations, and symbols as short forms (SFs) and their corresponding definitions as long forms (LFs). The study was designed to answer the following questions; i) how well a system performs in detecting LFs from novel text, ii) what the coverage is for various terminological knowledge bases in including SFs as synonyms of their LFs, and iii) how to combine results from various SF knowledge bases.MethodWe evaluated the following three publicly available detection systems in detecting LFs for SFs: i) a handcrafted pattern/rule based system by Ao and Takagi, ALICE, ii) a machine learning system by Chang et al., and iii) a simple alignment-based program by Schwartz and Hearst. In addition, we investigated the conceptual coverage of two terminological knowledge bases: i) the UMLS (the Unified Medical Language System), and ii) the BioThesaurus (a thesaurus of names for all UniProt protein records). We also implemented a web interface that provides a virtual integration of various SF knowledge bases.ResultsWe found that detection systems agree with each other on most cases, and the existing terminological knowledge bases have a good coverage of synonymous relationship for frequently defined LFs. The web interface allows people to detect SF definitions from text and to search several SF knowledge bases.AvailabilityThe web site is .

Highlights

  • Much of the new knowledge relevant to biomedical research is recorded as free text in the form of journal articles or annotation fields of databases

  • We found that detection systems agree with each other on most cases, and the existing terminological knowledge bases have a good coverage of synonymous relationship for frequently defined long forms (LFs)

  • How can we combine the results from various systems and short forms (SFs) knowledge bases? To answer those questions, we evaluated several LF detection systems that are publicly accessible using a corpus consisting of MEDLINE abstracts published between January 2006 and May 2006

Read more

Summary

Introduction

Much of the new knowledge relevant to biomedical research is recorded as free text in the form of journal articles or annotation fields of databases. Because of the complexity of the biomedical domain, biomedical terms are often lengthy They usually contain words that imply their corresponding semantic types, e.g., virus in EpsteinBarr virus or protein in latent membrane protein, or words that describe properties of referred entities such as latent in latent membrane protein. For biomedical concepts such as genes or proteins, it may be difficult to come up with short and yet descriptive terms for them. Concise representations of biomedical concepts such as acronyms, abbreviations, and symbols have been used in text for biomedical concepts that either occur frequently or are difficult to describe. Systems can detect Epstein-Barr virus representing a kind of virus but it would be difficult to infer the semantic type virus from its acronym, EBV. Note that some of the symbols may never be defined in text [1,12]

Methods
Results
Conclusion
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.