Recognition of Latin scientific names using artificial neural networks.

Damon P Little

doi:10.1002/aps3.11378

Abstract

PremiseThe automated recognition of Latin scientific names within vernacular text has many applications, including text mining, search indexing, and automated specimen‐label processing. Most published solutions are computationally inefficient, incapable of running within a web browser, and focus on texts in English, thus omitting a substantial portion of biodiversity literature.Methods and ResultsAn open‐source browser‐executable solution, Quaesitor, is presented here. It uses pattern matching (regular expressions) in combination with an ensembled classifier composed of an inclusion dictionary search (Bloom filter), a trio of complementary neural networks that differ in their approach to encoding text, and word length to automatically identify Latin scientific names in the 16 most common languages for biodiversity articles.ConclusionsIn combination, the classifiers can recognize Latin scientific names in isolation or embedded within the languages used for >96% of biodiversity literature titles. For three different data sets, they resulted in a 0.80–0.97 recall and a 0.69–0.84 precision at a rate of 8.6 ms/word.

Full Text