Abstract

This paper proposes an ensemble of classifiers for biomedical name recognition in which three classifiers, one Support Vector Machine and two discriminative Hidden Markov Models, are combined effectively using a simple majority voting strategy. In addition, we incorporate three post-processing modules, including an abbreviation resolution module, a protein/gene name refinement module and a simple dictionary matching module, into the system to further improve the performance. Evaluation shows that our system achieves the best performance from among 10 systems with a balanced F-measure of 82.58 on the closed evaluation of the BioCreative protein/gene name recognitiontask (Task 1A).

Highlights

  • With an overwhelming amount of textual information in biomedicine, there is a need for effective and efficient literature mining and knowledge discovery that can help biologists to gather and make use of the knowledge encoded in text documents

  • The only difference between the two discriminative Hidden Markov Models (DHMMs) comes from the part-of-speech (POS) features, which are trained on different corpora

  • Our evaluation on the dry-run data shows that the Support Vector Machine (SVM) using the POS feature trained on the refined BioCreative-POS corpus (Please see below for details) has high precision and low recall, the DHMM1 using the POS feature trained on the refined BioCreativePOS corpus has balanced precision and recall, and the DHMM2 using the POS feature trained on the unrefined BioCreative-POS corpus has low precision and high recall

Read more

Summary

Introduction

With an overwhelming amount of textual information in biomedicine, there is a need for effective and efficient literature mining and knowledge discovery that can help biologists to gather and make use of the knowledge encoded in text documents. MEDLINE [1], the primary research database serving the biomedical community, is an online bibliographic source of citations and abstracts dating from 1966 till present and currently contains over 12 million abstracts with 60,000 new abstracts each month. Each of them contains entries ranging from thousands to millions and multiplies rapidly. All of these resources are annotated manually by human experts. Such manual handling is much throughput-limited, extremely time-consuming and enormously expensive. In order to make organized and structured information available, automatically recognizing biomedical names becomes critical and is important for protein-protein interaction extraction, pathway construction, automatic database curation, etc

Methods
Results
Conclusion
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.