Mapping biological entities using the longest approximately common prefix method.

Alex Rudniy,James Geller,Min Song

doi:10.1186/1471-2105-15-187

Abstract

BackgroundThe significant growth in the volume of electronic biomedical data in recent decades has pointed to the need for approximate string matching algorithms that can expedite tasks such as named entity recognition, duplicate detection, terminology integration, and spelling correction. The task of source integration in the Unified Medical Language System (UMLS) requires considerable expert effort despite the presence of various computational tools. This problem warrants the search for a new method for approximate string matching and its UMLS-based evaluation.ResultsThis paper introduces the Longest Approximately Common Prefix (LACP) method as an algorithm for approximate string matching that runs in linear time. We compare the LACP method for performance, precision and speed to nine other well-known string matching algorithms. As test data, we use two multiple-source samples from the Unified Medical Language System (UMLS) and two SNOMED Clinical Terms-based samples. In addition, we present a spell checker based on the LACP method.ConclusionsThe Longest Approximately Common Prefix method completes its string similarity evaluations in less time than all nine string similarity methods used for comparison. The Longest Approximately Common Prefix outperforms these nine approximate string matching methods in its Maximum F1 measure when evaluated on three out of the four datasets, and in its average precision on two of the four datasets.

Highlights

The significant growth in the volume of electronic biomedical data in recent decades has pointed to the need for approximate string matching algorithms that can expedite tasks such as named entity recognition, duplicate detection, terminology integration, and spelling correction
We introduce the Longest Approximately Common Prefix (LACP) method for Approximate String Matching (ASM) and present the results of its use to improve the operation of a number of applications in biomedical informatics and related domains
This paper demonstrates how this fast string distance method provides performance that is superior to other methods on datasets from SNOMED SNOMED clinical terms (CT) and from multiple Unified Medical Language System (UMLS) sources (Table 1) in terms of average precision and Maximum F1

Summary

Introduction

The significant growth in the volume of electronic biomedical data in recent decades has pointed to the need for approximate string matching algorithms that can expedite tasks such as named entity recognition, duplicate detection, terminology integration, and spelling correction. The task of source integration in the Unified Medical Language System (UMLS) requires considerable expert effort despite the presence of various computational tools. This problem warrants the search for a new method for approximate string matching and its UMLS-based evaluation. The recent expansion of healthcare information systems that draw from multiple medical databases has resulted in redundant information, among other problems. This phenomenon, known as the duplicate detection problem, has caused problems with record. As additional sources are integrated into the UMLS, they will require reintegration with existing vocabularies [4]

Objectives

Methods

Results

Discussion

Conclusion