Vocabulary Matters: An Annotation Pipeline and Four Deep Learning Algorithms for Enzyme Named Entity Recognition.

Meiqi Wang,Avish Vijayaraghavan,Tim Beck,Joram M Posma

doi:10.1021/acs.jproteome.3c00367

Abstract

Enzymes are indispensable in many biological processes, and with biomedical literature growing exponentially, effective literature review becomes increasingly challenging. Natural language processing methods offer solutions to streamline this process. This study aims to develop an annotated enzyme corpus for training and evaluating enzyme named entity recognition (NER) models. A novel pipeline, combining dictionary matching and rule-based keyword searching, automatically annotated enzyme entities in >4800 full-text publications. Four deep learning NER models were created with different vocabularies (BioBERT/SciBERT) and architectures (BiLSTM/transformer) and evaluated on 526 manually annotated full-text publications. The annotation pipeline achieved an F1-score of 0.86 (precision = 1.00, recall = 0.76), surpassed by fine-tuned transformers for F1-score (BioBERT: 0.89, SciBERT: 0.88) and recall (0.86) with BiLSTM models having higher precision (0.94) than transformers (0.92). The annotation pipeline runs in seconds on standard laptops with almost perfect precision, but was outperformed by fine-tuned transformers in terms of F1-score and recall, demonstrating generalizability beyond the training data. In comparison, SciBERT-based models exhibited higher precision, and BioBERT-based models exhibited higher recall, highlighting the importance of vocabulary and architecture. These models, representing the first enzyme NER algorithms, enable more effective enzyme text mining and information extraction. Codes for automated annotation and model generation are available from https://github.com/omicsNLP/enzymeNER and https://zenodo.org/doi/10.5281/zenodo.10581586.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Vocabulary Matters: An Annotation Pipeline and Four Deep Learning Algorithms for Enzyme Named Entity Recognition.

Abstract

Talk to us

Similar Papers

More From: Journal of proteome research

Lead the way for us

Journal: Journal of proteome research	Publication Date: May 11, 2024
License type: CC BY 4.0

Similar Papers

Automatic Extraction of Comprehensive Drug Safety Information from Adverse Drug Event Narratives in the Korea Adverse Event Reporting System Using Natural Language Processing Techniques.
Siun Kim ... Yesol Hong
Drug Safety | VOL. 46
Siun Kim, et. al.Siun Kim ... Yesol Hong
17 Jun 2023
Drug Safety | VOL. 46

Introducing MagBERT: A language model for magnesium textual data mining and analysis
Surjeet Kumar ... Dae Ho Yoon
Journal of Magnesium and Alloys | VOL. 12
Surjeet Kumar, et. al.Surjeet Kumar ... Dae Ho Yoon
01 Aug 2024
Journal of Magnesium and Alloys | VOL. 12

Sentence-based undersampling for named entity recognition using genetic algorithm
Abbas Akkasi
Iran Journal of Computer Science | VOL. 1
Abbas AkkasiAbbas Akkasi
06 Mar 2018
Iran Journal of Computer Science | VOL. 1

TBR-NER: Research on COVID-19 Text Information Extraction Based on Joint Learning of Topic Recognition and Named Entity Recognition
Xin Feng ... Zhang Hang
Journal of Sensors | VOL. 2022
Xin Feng, et. al.Xin Feng ... Zhang Hang
04 Aug 2022
Journal of Sensors | VOL. 2022

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Vocabulary Matters: An Annotation Pipeline and Four Deep Learning Algorithms for Enzyme Named Entity Recognition.

Abstract

Talk to us

Similar Papers

More From: Journal of proteome research