Improving the performance of dictionary-based approaches in protein name recognition

Yoshimasa Tsuruoka,Jun’Ichi Tsujii

doi:10.1016/j.jbi.2004.08.003

Abstract

Dictionary-based protein name recognition is often a first step in extracting information from biomedical documents because it can provide ID information on recognized terms. However, dictionary-based approaches present two fundamental difficulties: (1) false recognition mainly caused by short names; (2) low recall due to spelling variations. In this paper, we tackle the former problem using machine learning to filter out false positives and present two alternative methods for alleviating the latter problem of spelling variations. The first is achieved by using approximate string searching, and the second by expanding the dictionary with a probabilistic variant generator, which we propose in this paper. Experimental results using the GENIA corpus revealed that filtering using a naive Bayes classifier greatly improved precision with only a slight loss of recall, resulting in 10.8% improvement in F-measure, and dictionary expansion with the variant generator gave further 1.6% improvement and achieved an F-measure of 66.6%.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: Journal of Biomedical Informatics	Publication Date: Oct 8, 2004
Citations: 88	License type: elsevier-specific: oa user license

R Discovery Prime

R Discovery Prime

Improving the performance of dictionary-based approaches in protein name recognition

Abstract

Talk to us

Similar Papers

More From: Journal of Biomedical Informatics

Lead the way for us

Similar Papers

Use of morphological analysis in protein name recognition
Kaoru Yamamoto ... Yuji Matsumoto
Journal of Biomedical Informatics | VOL. 37
Kaoru Yamamoto, et. al.Kaoru Yamamoto ... Yuji Matsumoto
22 Sep 2004
Journal of Biomedical Informatics | VOL. 37

Boosting precision and recall of dictionary-based protein name recognition
Yoshimasa Tsuruoka ... Jun'Ichi Tsujii
-
Yoshimasa Tsuruoka, et. al.Yoshimasa Tsuruoka ... Jun'Ichi Tsujii
01 Jan 2003
01 Jan 2003

False recognition and word length: A reanalysis of Roediger, Watson, McDermott, and Gallo (2001) and some new data
Stephen Madigan ... James Neuse
Psychonomic Bulletin & Review | VOL. 11
Stephen Madigan, et. al.Stephen Madigan ... James Neuse
01 Jun 2004
False recognition and word length: A reanalysis of Roediger, Watson, McDermott, and Gallo (2001) and some new data
Stephen Madigan ... James Neuse

Statistical Character-Based Syntax Similarity Measurement for Detecting Biomedical Syntax Variations through Named Entity Recognition
Hossein Tohidi ... Masrah Azrifan Azmi
-
Hossein Tohidi, et. al.Hossein Tohidi ... Masrah Azrifan Azmi
01 Jan 2010
01 Jan 2010

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Improving the performance of dictionary-based approaches in protein name recognition

Abstract

Talk to us

Similar Papers

More From: Journal of Biomedical Informatics