BiodiViz: Leveraging NER and RE for Automated Knowledge Graph Generation in Biodiversity Research

Angela Shannen Tan,Paul Michael Dimayuga,Roselyn Gabud

doi:10.3897/biss.8.140428

Abstract

In biodiversity research, the integration of machine learning and data visualization is increasingly important for uncovering valuable insights from academic literature. This study introduces an innovative knowledge graph application, BiodiViz, designed to translate intricate text into intuitive visual representations, fostering a deeper comprehension of biodiversity relationships. BiodiViz uses the top-performing Named Entity Recognition (NER) and Relation Extraction (RE) models to automatically generate a comprehensive knowledge graph for biodiversity research. The NER model extracts and categorizes entities like organisms, phenomena, and habitats, while the RE model identifies relationships such as "have," "occur in," and "influence" from the BiodivNERE dataset (Abdelmageed et al. 2022). These entities and relationships are organized into nodes and edges within a graph. Researchers input text into BiodiViz, producing a visual knowledge graph that simplifies the analysis of complex biodiversity data, reducing manual effort and enhancing efficiency. Named Entity Recognition & Relation Extraction BiodiViz leverages advanced Bidirectional Encoder Representations from Transformers (BERT)-based Large Language Models (LLMs) (Rogers et al. 2020), fine-tuned specifically for NER and RE tasks using the BiodivNERE dataset. The fine-tuning process involved various models, including BERT (Devlin et al. 2019), ELECTRA (Clark et al. 2020), and BiodivBERT (Abdelmageed et al. 2023). These models were evaluated for performance using the results of their F1-score as the main metric, which is the harmonic mean of precision (the proportion of true positive results among all positive predictions) and recall (the proportion of true positive results among all actual positives), with BiodivBERT achieving an F1-score of 77.16% for the NER task, while BERT excelled in the RE task with an F1-score of 81.28%. Rigorous hyperparameter optimization further enhanced the performance of BiodivBERT in the RE task by 3.38%. The BiodivNERE corpora by Abdelmageed et al. (2022) were used to fine-tune several models for NER and RE tasks in the biodiversity domain. The first corpus from the BiodivNERE corpora is BiodivNER, which is a gold standard dataset (manually labelled test corpora) for evaluating NER tasks. The fine-tuning process employed the token classification method from the Hugging Face library (Hugging Face 2023b), which assigns labels to each token in a sequence. Experiments were conducted with a batch size of four, meaning the model processes four examples/rows of data at a time before making an update to improve its learning. This is due to the constraints of the NVIDIA® GeForce RTX™ 3060 graphics processor. (NVIDIA 2024) Model performance was evaluated using the seqeval library (Nakayama 2018), focusing on accuracy, precision, recall, and F1 scores. For text classification, the second corpus, BiodivRE, was utilized, following previous research recommendations to explore fine-tuning settings for BiodivBERT. Hyperparameter optimization (Feurer and Hutter 2019) was conducted using Hugging Face’s Trainer API with an Optuna backend (Hugging Face 2023a), concentrating on learning rate and the number of training epochs (i.e., the number of complete passes through the entire dataset during model training). The BiodiViz Knowledge Graph Application The fine-tuned NER and RE models with the best F1-scores—BiodivBERT and BERT, respectively—were integrated into the knowledge graph application. Fig. 1 illustrates the flowchart of the application pipeline. Each sentence in the input text will go through the NER model to identify and label the entities within the sentence. Subsequently, these labeled entities, together with the original sentence, will be input into the RE model. The RE model will analyze every pair of entities for a potential relation and output the type of relation they share. The application will then utilize this data to create a graph with appropriate labels and color-coding. An example of the application's user interface with the knowledge graph is shown in Fig. 2. This study highlights the practical application of machine learning and data visualization in advancing biodiversity research, emphasizing the importance of developing user-friendly tools to support scientific exploration and discovery. The BiodiViz application, including the code and resources, is available on GitHub*1, providing an accessible tool for biodiversity researchers to streamline their analyses.

Full Text

Published Version

View

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

BiodiViz: Leveraging NER and RE for Automated Knowledge Graph Generation in Biodiversity Research

Abstract

Published Version

Talk to us

Similar Papers

More From: Biodiversity Information Science and Standards

Lead the way for us

Journal: Biodiversity Information Science and Standards	Publication Date: Oct 29, 2024
License type: CC BY 4.0

Similar Papers

BioRED: a rich biomedical relation extraction dataset.
Ling Luo ... Po-Ting Lai
Briefings in Bioinformatics | VOL. 23
Ling Luo, et. al.Ling Luo ... Po-Ting Lai
19 Jul 2022
Briefings in Bioinformatics | VOL. 23

Negation-based transfer learning for improving biomedical Named Entity Recognition and Relation Extraction
Hermenegildo Fabregat ... Lourdes Araujo
Journal of Biomedical Informatics | VOL. 138
Hermenegildo Fabregat, et. al.Hermenegildo Fabregat ... Lourdes Araujo
04 Jan 2023
Journal of Biomedical Informatics | VOL. 138

Deep learning approaches for extracting adverse events and indications of dietary supplements from clinical text.
Yadan Fan ... Rui Zhang
Journal of the American Medical Informatics Association | VOL. 28
Yadan Fan, et. al.Yadan Fan ... Rui Zhang
05 Nov 2020
Journal of the American Medical Informatics Association | VOL. 28

Extracting comprehensive clinical information for breast cancer using deep learning methods
Xiaohui Zhang ... Qiang Sun
International Journal of Medical Informatics | VOL. 132
Xiaohui Zhang, et. al.Xiaohui Zhang ... Qiang Sun
02 Oct 2019
International Journal of Medical Informatics | VOL. 132

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

BiodiViz: Leveraging NER and RE for Automated Knowledge Graph Generation in Biodiversity Research

Abstract

Published Version

Talk to us

Similar Papers

More From: Biodiversity Information Science and Standards