NLIMED: Natural Language Interface for Model Entity Discovery in Biosimulation Model Repositories.

Yuda Munarko,David P Nickerson,Dewan M Sarwar,John H Gennari,Anand Rampadarath,Koray Atalag,Maxwell L Neal

doi:10.3389/fphys.2022.820683

Abstract

Semantic annotation is a crucial step to assure reusability and reproducibility of biosimulation models in biology and physiology. For this purpose, the COmputational Modeling in BIology NEtwork (COMBINE) community recommends the use of the Resource Description Framework (RDF). This grounding in RDF provides the flexibility to enable searching for entities within models (e.g., variables, equations, or entire models) by utilizing the RDF query language SPARQL. However, the rigidity and complexity of the SPARQL syntax and the nature of the tree-like structure of semantic annotations, are challenging for users. Therefore, we propose NLIMED, an interface that converts natural language queries into SPARQL. We use this interface to query and discover model entities from repositories of biosimulation models. NLIMED works with the Physiome Model Repository (PMR) and the BioModels database and potentially other repositories annotated using RDF. Natural language queries are first “chunked” into phrases and annotated against ontology classes and predicates utilizing different natural language processing tools. Then, the ontology classes and predicates are composed as SPARQL and finally ranked using our SPARQL Composer and our indexing system. We demonstrate that NLIMED's approach for chunking and annotating queries is more effective than the NCBO Annotator for identifying relevant ontology classes in natural language queries.Comparison of NLIMED's behavior against historical query records in the PMR shows that it can adapt appropriately to queries associated with well-annotated models.

Highlights

The Resource Description Framework (RDF) is a standard data model from the semantic web community that is used in semantically annotated biosimulation models such as those formatted in CellML (Cuellar et al, 2003) and Systems Biology Markup Language (SBML) (Hucka et al, 2003) in the Physiome Repository Model (PMR) (Yu et al, 2011) and BioModels Database (Chelliah et al, 2015)
We show that the Natural Language Interface for Model Entity Discovery (NLIMED) approach for detecting ontology classes in Natural Language Query (NLQ) is more effective than the NCBO Annotator based on precision, recall, and Fmeasure statistics with margins above 0.13
We demonstrated NLIMED, an interface for translating NLQ into SPARQL that consists of NLQ Annotator and SPARQL Generator modules, for model entity discovery

Summary

Introduction

The Resource Description Framework (RDF) is a standard data model from the semantic web community that is used in semantically annotated biosimulation models such as those formatted in CellML (Cuellar et al, 2003) and Systems Biology Markup Language (SBML) (Hucka et al, 2003) in the Physiome Repository Model (PMR) (Yu et al, 2011) and BioModels Database (Chelliah et al, 2015). Composite annotations are logical statements linking multiple knowledge resource terms, enabling modelers to precisely define model elements in a structured manner Methods such as those presented here are able to make use of that structure to go beyond the raw RDF triples with which a model may be annotated

Methods

Results

Discussion

Conclusion