Automatic vs. manual curation of a multi-source chemical dictionary: the impact on text mining

Kristina M Hettne,Erik M Van Mulligen,Antony J Williams,Valery Tkachenko,Jos Kleinjans,Jan A Kors

doi:10.1186/1758-2946-2-3

Kristina M Hettne, Erik M Van Mulligen + Show 4 more

Open Access

https://doi.org/10.1186/1758-2946-2-3

Copy DOI

Journal: Journal of Cheminformatics	Publication Date: Mar 23, 2010
Citations: 60	License type: CC BY 2.0

Affiliation: Maastricht University, Erasmus MC

Abstract

BackgroundPreviously, we developed a combined dictionary dubbed Chemlist for the identification of small molecules and drugs in text based on a number of publicly available databases and tested it on an annotated corpus. To achieve an acceptable recall and precision we used a number of automatic and semi-automatic processing steps together with disambiguation rules. However, it remained to be investigated which impact an extensive manual curation of a multi-source chemical dictionary would have on chemical term identification in text. ChemSpider is a chemical database that has undergone extensive manual curation aimed at establishing valid chemical name-to-structure relationships.ResultsWe acquired the component of ChemSpider containing only manually curated names and synonyms. Rule-based term filtering, semi-automatic manual curation, and disambiguation rules were applied. We tested the dictionary from ChemSpider on an annotated corpus and compared the results with those for the Chemlist dictionary. The ChemSpider dictionary of ca. 80 k names was only a 1/3 to a 1/4 the size of Chemlist at around 300 k. The ChemSpider dictionary had a precision of 0.43 and a recall of 0.19 before the application of filtering and disambiguation and a precision of 0.87 and a recall of 0.19 after filtering and disambiguation. The Chemlist dictionary had a precision of 0.20 and a recall of 0.47 before the application of filtering and disambiguation and a precision of 0.67 and a recall of 0.40 after filtering and disambiguation.ConclusionsWe conclude the following: (1) The ChemSpider dictionary achieved the best precision but the Chemlist dictionary had a higher recall and the best F-score; (2) Rule-based filtering and disambiguation is necessary to achieve a high precision for both the automatically generated and the manually curated dictionary. ChemSpider is available as a web service at http://www.chemspider.com/ and the Chemlist dictionary is freely available as an XML file in Simple Knowledge Organization System format on the web at http://www.biosemantics.org/chemlist.

Highlights

We developed a combined dictionary dubbed Chemlist for the identification of small molecules and drugs in text based on a number of publicly available databases and tested it on an annotated corpus
Before pre-processing, the ChemSpider dictionary contained 157,173 terms belonging to 84,065 entities and after pre-processing 160,898 terms belonging to 84,059 entities
Dictionary term strings that matched the start and end positions of the chemical term strings in the corpus constituted true positives (TP), term strings that were not marked as chemical term strings in the corpus but still matched a dictionary term string were false positives (FP), and chemical term strings in the corpus that were not matched were false negatives (FN)

Summary

Introduction

We developed a combined dictionary dubbed Chemlist for the identification of small molecules and drugs in text based on a number of publicly available databases and tested it on an annotated corpus. To achieve an acceptable precision (0.67) and recall (0.40) we used a number of automatic and semi-automatic processing steps together with disambiguation rules It remained to be investigated which impact an extensive manual curation of a multi-source chemical dictionary would have on chemical term identification in text. We expect that a higher precision can be reached with a manually curated dictionary

Objectives

Results

Discussion

Conclusion