Abstract

Biomedical named entity recognition (BNER) is one of the most essential and initial tasks (discovering relations between biomedical entities, identifying molecular pathways, etc.) of biomedical information retrieval. Although named entity recognition performed well in ordinary text, it still remains challenging in molecular biology domain because of the complex nature of biomedical nomenclature, different kinds of spelling forms and many more reasons. Even though biomedical entities in biological text are found successfully, classifying them into relevant biomedical classes such as genes, proteins, diseases, drug names, etc. is still another challenge and an open question. This paper presents a new method to classify biomedical named entities into protein and non-protein classes. Our approach employs Random Forest, a machine learning algorithm, with a new combination of features. They are orthographic, keyword and morphological, as well as a probabilistic feature called Proteinhood and a Protein-Score feature based on the Medline abstracts cited on the Pubmed, which are the main contributions in the paper. A series of experiments is conducted to compare the proposed approach with other state of the art approaches. Our protein named entity classifier shows significant performance in the experiments on GENIA corpus achieving the highest values of precision 93.8%, recall 83.8% and F-measure 88.5% for protein named entity identification. In this study we showed the effect of new Proteinhood and Protein-Score features as well as adjusting parameters of Random Forest algorithm.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.