NLM-Gene, a richly annotated gold standard dataset for gene entities that addresses ambiguity and multi-species gene recognition

Rezarta Islamaj,Chih-Hsuan Wei,David Cissel,Nicholas Miliaras,Olga Printseva,Oleg Rodionov,Keiko Sekiya,Janice Ward,Zhiyong Lu

doi:10.1016/j.jbi.2021.103779

Rezarta Islamaj, Chih-Hsuan Wei + Show 7 more

Open Access

https://doi.org/10.1016/j.jbi.2021.103779

Copy DOI

Export

Save

Cite

Abstract
Full-Text
Similar Papers

Abstract

Listen

The automatic recognition of gene names and their corresponding database identifiers in biomedical text is an important first step for many downstream text-mining applications. While current methods for tagging gene entities have been developed for biomedical literature, their performance on species other than human is substantially lower due to the lack of annotation data. We therefore present the NLM-Gene corpus, a high-quality manually annotated corpus for genes developed at the US National Library of Medicine (NLM), covering ambiguous gene names, with an average of 29 gene mentions (10 unique identifiers) per document, and a broader representation of different species (including Homo sapiens, Mus musculus, Rattus norvegicus, Drosophila melanogaster, Arabidopsis thaliana, Danio rerio, etc.) when compared to previous gene annotation corpora. NLM-Gene consists of 550 PubMed abstracts from 156 biomedical journals, doubly annotated by six experienced NLM indexers, randomly paired for each document to control for bias. The annotators worked in three annotation rounds until they reached complete agreement. This gold-standard corpus can serve as a benchmark to develop & test new gene text mining algorithms. Using this new resource, we have developed a new gene finding algorithm based on deep learning which improved both on precision and recall from existing tools. The NLM-Gene annotated corpus is freely available at ftp://ftp.ncbi.nlm.nih.gov/pub/lu/NLMGene. We have also applied this tool to the entire PubMed/PMC with their results freely accessible through our web-based tool PubTator (www.ncbi.nlm.nih.gov/research/pubtator).

Full Text

Published Version

View

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: Journal of Biomedical Informatics	Publication Date: Apr 9, 2021
Citations: 19	License type: public-domain

R Discovery Prime

NLM-Gene, a richly annotated gold standard dataset for gene entities that addresses ambiguity and multi-species gene recognition

Abstract

Published Version

Talk to us

Similar Papers

More From: Journal of Biomedical Informatics

Lead the way for us

Similar Papers

Searching for Evidence-Based Medicine in the Literature Part 2: Resources
B A Bartkowiak
Clinical Medicine & Research | VOL. 3
B A BartkowiakB A Bartkowiak
01 Feb 2005
Clinical Medicine & Research | VOL. 3

Benefits of Tiotropium/Olodaterol Compared with Tiotropium in Patients with COPD Receiving only LAMA at Baseline: Pooled Analysis of the TONADO® and OTEMTO® Studies.
Roland Buhl ... Gary T Ferguson
Advances in Therapy | VOL. 37
Roland Buhl, et. al.Roland Buhl ... Gary T Ferguson
27 May 2020
Advances in Therapy | VOL. 37

Quality assessment of online patient education resources for peripheral neuropathy
David R Hansberry ... Robert F Heary
Journal of the Peripheral Nervous System | VOL. 18
David R Hansberry, et. al.David R Hansberry ... Robert F Heary
01 Mar 2013
Journal of the Peripheral Nervous System | VOL. 18

Disambiguating the bisphosphonates
T Van Den Wyngaert ... J.B Vermorken
Annals of Oncology | VOL. 19
T Van Den Wyngaert, et. al.T Van Den Wyngaert ... J.B Vermorken
01 Jul 2008
Annals of Oncology | VOL. 19

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

NLM-Gene, a richly annotated gold standard dataset for gene entities that addresses ambiguity and multi-species gene recognition

Abstract

Published Version

Talk to us

Similar Papers

More From: Journal of Biomedical Informatics