Thesaurus-based disambiguation of gene symbols

Bob Ja Schijvenaars,Jan A Kors,Hester M Wain,Erik M Van Mulligen,Marc Weeber,Barend Mons,Martijn J Schuemie

doi:10.1186/1471-2105-6-149

Bob Ja Schijvenaars, Jan A Kors + Show 5 more

Open Access

https://doi.org/10.1186/1471-2105-6-149

Copy DOI

Journal: BMC Bioinformatics	Publication Date: Jan 1, 2005
Citations: 69	License type: CC BY 2.0

Affiliation: Erasmus MC, University College London

Abstract

BackgroundMassive text mining of the biological literature holds great promise of relating disparate information and discovering new knowledge. However, disambiguation of gene symbols is a major bottleneck.ResultsWe developed a simple thesaurus-based disambiguation algorithm that can operate with very little training data. The thesaurus comprises the information from five human genetic databases and MeSH. The extent of the homonym problem for human gene symbols is shown to be substantial (33% of the genes in our combined thesaurus had one or more ambiguous symbols), not only because one symbol can refer to multiple genes, but also because a gene symbol can have many non-gene meanings. A test set of 52,529 Medline abstracts, containing 690 ambiguous human gene symbols taken from OMIM, was automatically generated. Overall accuracy of the disambiguation algorithm was up to 92.7% on the test set.ConclusionThe ambiguity of human gene symbols is substantial, not only because one symbol may denote multiple genes but particularly because many symbols have other, non-gene meanings. The proposed disambiguation approach resolves most ambiguities in our test set with high accuracy, including the important gene/not a gene decisions. The algorithm is fast and scalable, enabling gene-symbol disambiguation in massive text mining applications.

Highlights

Massive text mining of the biological literature holds great promise of relating disparate information and discovering new knowledge
We describe our disambiguation approach and assess the performance of the disambiguation algorithm on a large test set of documents
Overall accuracy of the disambiguation algorithm was up to 92.7% on the test set

Summary

Introduction

Massive text mining of the biological literature holds great promise of relating disparate information and discovering new knowledge. A number of information retrieval systems have been proposed to extract and relate pertinent biological information from large corpora of text [1,2,3,4,5,6,7,8,9]. These systems even hold promise for the discovery of new, "tacit" knowledge that is hidden in the literature. One approach to deal with this synonym problem is to make use of the information about genes and their aliases that is available in existing genetic databases

Results

Discussion

Conclusion

Full Text

Paper version not known

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Thesaurus-based disambiguation of gene symbols

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: BMC Bioinformatics

Lead the way for us

Similar Papers

Combining multiple evidence for gene symbol disambiguation
Hua Xu ... Jung-Wei Fan
-
Hua Xu, et. al.Hua Xu ... Jung-Wei Fan
01 Jan 2007
01 Jan 2007

Sex Determination and Differentiation in Insects
L Van De Zande ... E.C Verhulst
-
L Van De Zande, et. al.L Van De Zande ... E.C Verhulst
14 Mar 2014
14 Mar 2014

HGNChelper: identification and correction of invalid gene symbols for human and mouse
Sean Davis ... Marcel Ramos
F1000Research | VOL. 9
Sean Davis, et. al.Sean Davis ... Marcel Ramos
27 Apr 2022
F1000Research | VOL. 9

HGNChelper: identification and correction of invalid gene symbols for human and mouse.
Sehyun Oh ... Sean Davis
F1000Research | VOL. 9
Sehyun Oh, et. al.Sehyun Oh ... Sean Davis
21 Dec 2020
F1000Research | VOL. 9

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Thesaurus-based disambiguation of gene symbols

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: BMC Bioinformatics