Automatic extraction of candidate nomenclature terms using the doublet method

Jules J Berman

doi:10.1186/1472-6947-5-35

Abstract

BackgroundNew terminology continuously enters the biomedical literature. How can curators identify new terms that can be added to existing nomenclatures? The most direct method, and one that has served well, involves reading the current literature. The scholarly curator adds new terms as they are encountered. Present-day scholars are severely challenged by the enormous volume of biomedical literature. Curators of medical nomenclatures need computational assistance if they hope to keep their terminologies current. The purpose of this paper is to describe a method of rapidly extracting new, candidate terms from huge volumes of biomedical text. The resulting lists of terms can be quickly reviewed by curators and added to nomenclatures, if appropriate. The candidate term extractor uses a variation of the previously described doublet coding method. The algorithm, which operates on virtually any nomenclature, derives from the observation that most terms within a knowledge domain are composed entirely of word combinations found in other terms from the same knowledge domain. Terms can be expressed as sequences of overlapping word doublets that have more specific meaning than the individual words that compose the term. The algorithm parses through text, finding contiguous sequences of word doublets that are known to occur somewhere in the reference nomenclature. When a sequence of matching word doublets is encountered, it is compared with whole terms already included in the nomenclature. If the doublet sequence is not already in the nomenclature, it is extracted as a candidate new term. Candidate new terms can be reviewed by a curator to determine if they should be added to the nomenclature. An implementation of the algorithm is demonstrated, using a corpus of published abstracts obtained through the National Library of Medicine's PubMed query service and using "The developmental lineage classification and taxonomy of neoplasms" as a reference nomenclature.ResultsA 31+ Megabyte corpus of pathology journal abstracts was parsed using the doublet extraction method. This corpus consisted of 4,289 records, each containing an abstract title. The total number of words included in the abstract titles was 50,547. New candidate terms for the nomenclature were automatically extracted from the titles of abstracts in the corpus. Total execution time on a desktop computer with CPU speed of 2.79 GHz was 2 seconds. The resulting output consisted of 313 new candidate terms, each consisting of concatenated doublets found in the reference nomenclature. Human review of the 313 candidate terms yielded a list of 285 terms approved by a curator. A final automatic extraction of duplicate terms yielded a final list of 222 new terms (71% of the original 313 extracted candidate terms) that could be added to the reference nomenclature.ConclusionThe doublet method for automatically extracting candidate nomenclature terms can be used to quickly find new terms from vast amounts of text. The method can be immediately adapted for virtually any text and any nomenclature. An implementation of the algorithm, in the Perl programming language, is provided with this article.

Highlights

New terminology continuously enters the biomedical literature
Analysis of doublet occurrences within terms included in the nomenclature The current version of the neoplasm nomenclature contains 149,192 unique terms
The phrases are chosen to meet two criteria: 1) they are composed of word doublets that are contained in an existing nomenclature, and 2) the matched phrases do not already occur in the

Summary

Introduction

New terminology continuously enters the biomedical literature. How can curators identify new terms that can be added to existing nomenclatures? The most direct method, and one that has served well, involves reading the current literature. The purpose of this paper is to describe a method of rapidly extracting new, candidate terms from huge volumes of biomedical text. If the doublet sequence is not already in the nomenclature, it is extracted as a candidate new term. The misuse of medical terminology can lead to medical errors, as indicated by the U.S Joint Commission on Accreditation of Healthcare Organization's recent ban on certain common medical abbreviations [2]. This action was taken to reduce the occurrence of medication errors that result when non-standard abbreviations are misinterpreted.

Objectives

Results

Discussion

Conclusion

Full Text

Paper version not known

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: BMC Medical Informatics and Decision Making	Publication Date: Oct 18, 2005
Citations: 8	License type: cc-by

R Discovery Prime

R Discovery Prime

Automatic extraction of candidate nomenclature terms using the doublet method

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: BMC Medical Informatics and Decision Making

Lead the way for us

Similar Papers

Research on Automatic Chinese Multi-word Term Extraction Based on Integration of Web Information and Term Component
Wei Kang ... Yao Liu
-
Wei Kang, et. al.Wei Kang ... Yao Liu
01 Jan 2009
01 Jan 2009

Analysing the Impact of Supervised Machine Learning on Automatic Term Extraction: HAMLET vs TermoStat
Ayla Rigouts Terryn ... Patrick Drouin
-
Ayla Rigouts Terryn, et. al.Ayla Rigouts Terryn ... Patrick Drouin
22 Oct 2019
22 Oct 2019

An ontology development approach using concept maps driven by automatic term extraction
Rizwan Iqbal ... Masrah Azrifah Azmi Murad
International Journal of Information and Communication Technology | VOL. 10
Rizwan Iqbal, et. al.Rizwan Iqbal ... Masrah Azrifah Azmi Murad
01 Jan 2017
International Journal of Information and Communication Technology | VOL. 10

Nearsighted? farsighted? pragmatic? idealistic? “Charting a Course for the 21st Century”: the National Library of Medicine's long-range plan, 2006–2016
Gail Yokote
Journal of the Medical Library Association : JMLA | VOL. 96
Gail YokoteGail Yokote
01 Oct 2008
Journal of the Medical Library Association : JMLA | VOL. 96

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Automatic extraction of candidate nomenclature terms using the doublet method

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: BMC Medical Informatics and Decision Making