Taxamatch, an algorithm for near ('fuzzy') matching of scientific names in taxonomic databases.

Tony Rees,Diego Fontaneto

doi:10.1371/journal.pone.0107510

Abstract

Misspellings of organism scientific names create barriers to optimal storage and organization of biological data, reconciliation of data stored under different spelling variants of the same name, and appropriate responses from user queries to taxonomic data systems. This study presents an analysis of the nature of the problem from first principles, reviews some available algorithmic approaches, and describes Taxamatch, an improved name matching solution for this information domain. Taxamatch employs a custom Modified Damerau-Levenshtein Distance algorithm in tandem with a phonetic algorithm, together with a rule-based approach incorporating a suite of heuristic filters, to produce improved levels of recall, precision and execution time over the existing dynamic programming algorithms n-grams (as bigrams and trigrams) and standard edit distance. Although entirely phonetic methods are faster than Taxamatch, they are inferior in the area of recall since many real-world errors are non-phonetic in nature. Excellent performance of Taxamatch (as recall, precision and execution time) is demonstrated against a reference database of over 465,000 genus names and 1.6 million species names, as well as against a range of error types as present at both genus and species levels in three sets of sample data for species and four for genera alone. An ancillary authority matching component is included which can be used both for misspelled names and for otherwise matching names where the associated cited authorities are not identical.

Highlights

The problem domain Scientific names of organisms, together with the higher taxonomic groups within which they are nested, represent the key identifiers by which the bulk of the world’s biodiversity information is organized and stored [1], yet in many cases they may be unfamiliar and non-intuitive to spell, for example Syzygotettix boettcheri, a grasshopper; Cirrhitichthys oxyrhynchos, a fish
The reference database of notionally correctly spelled names used in the present study was the author’s IRMNG database cited above, which at time of testing in May 2013 contained 465,433 genus names and 1,674,319 separate species; a small number of target names flagged misspellings, nomina nuda and later usages were masked during the test operation so as to avoid generation of misleading results, such as would otherwise arise where an input misspelling matches a known misspelled name held in the database, or a stored misspelling masks a true hit during the result shaping stage
This study demonstrates that a hybrid approach incorporating both a Modified Damerau-Levenshtein Distance algorithm and a phonetic algorithm customized to the characteristics of taxonomic names can detect close to 100% of errors in taxon scientific names, of multiple error types, and that good levels of both precision and efficiency can be obtained via incorporation of appropriate rule-based filters at relevant points in the algorithm design

Summary

Introduction

The problem domain Scientific names of organisms, together with the higher taxonomic groups within which they are nested, represent the key identifiers by which the bulk of the world’s biodiversity information is organized and stored [1], yet in many cases they may be unfamiliar and non-intuitive to spell, for example Syzygotettix boettcheri, a grasshopper; Cirrhitichthys oxyrhynchos, a fish. The present work examines the performance of selected algorithms of the above types from first principles and in practice using a range of real world misspellings of scientific names of organisms drawn from a number of sources, tested against a reference database containing over 465,000 correctly spelled genus names and 1.67 million species names, and describes the Taxamatch algorithm which is a composite approach designed with the aim of providing optimal performance for near matching of taxon scientific names. An authority comparison module is presented which computes numeric similarities between authorities in the case that these are available for input and target names, which can either be used within Taxamatch to assist in the discrimination of likely true from false matches, or as a standalone test for measuring authority similarity in other situations

Methods

Results

Discussion

Conclusion