Abstract

BackgroundLarge quantities of biomedical data are being produced at a rapid pace for a variety of organisms. With ontologies proliferating, data is increasingly being stored using the RDF data model and queried using RDF based querying languages. While existing systems facilitate the querying in various ways, the scientist must map the question in his or her mind to the interface used by the systems. The field of natural language processing has long investigated the challenges of designing natural language based retrieval systems. Recent efforts seek to bring the ability to pose natural language questions to RDF data querying systems while leveraging the associated ontologies. These analyze the input question and extract triples (subject, relationship, object), if possible, mapping them to RDF triples in the data. However, in the biomedical context, relationships between entities are not always explicit in the question and these are often complex involving many intermediate concepts.ResultsWe present a new framework, OntoNLQA, for querying RDF data annotated using ontologies which allows posing questions in natural language. OntoNLQA offers five steps in order to answer natural language questions. In comparison to previous systems, OntoNLQA differs in how some of the methods are realized. In particular, it introduces a novel approach for discovering the sophisticated semantic associations that may exist between the key terms of a natural language question, in order to build an intuitive query and retrieve precise answers. We apply this framework to the context of parasite immunology data, leading to a system called AskCuebee that allows parasitologists to pose genomic, proteomic and pathway questions in natural language related to the parasite, Trypanosoma cruzi. We separately evaluate the accuracy of each component of OntoNLQA as implemented in AskCuebee and the accuracy of the whole system. AskCuebee answers 68 % of the questions in a corpus of 125 questions, and 60 % of the questions in a new previously unseen corpus. If we allow simple corrections by the scientists, this proportion increases to 92 %.ConclusionsWe introduce a novel framework for question answering and apply it to parasite immunology data. Evaluations of translating the questions to RDF triple queries by combining machine learning, lexical similarity matching with ontology classes, properties and instances for specificity, and discovering associations between them demonstrate that the approach performs well and improves on previous systems. Subsequently, OntoNLQA offers a viable framework for building question answering systems in other biomedical domains.Electronic supplementary materialThe online version of this article (doi:10.1186/s13326-015-0029-x) contains supplementary material, which is available to authorized users.

Highlights

  • Large quantities of biomedical data are being produced at a rapid pace for a variety of organisms

  • We present a new approach for answering natural language questions on structured data that combines machine learning with semantic computing: use of existing ontologies, their structure and annotated data, and triple-based queries

  • The resulting system called AskCuebee allows parasitologists to pose genomic, proteomic and pathway questions in natural language related to the parasite, T. cruzi, for the first time

Read more

Summary

Introduction

Large quantities of biomedical data are being produced at a rapid pace for a variety of organisms. The RDF data model has the advantage of making the relationships between the data items explicit, and provides a straightforward way for annotating data using ontologies An example of this is the semantic problem solving environment for the immunology of the parasite, Trypanasoma cruzi (T. cruzi), which utilizes an RDF triple store for hosting the parasite’s genomic (microarray), proteomic (transcriptome) and pathway data [4]. The data is annotated using the parasite experiment ontology (PEO) and queried using the open-source Cuebee [5] that provides an interface for facilitating the parasitologist’s formulation of SPARQL queries Another example is the translational medicine ontology and knowledge base [6], which utilizes the unifying ontology to annotate integrated genomic, proteomic and disease data, along with patient electronic records. The data may be browsed in a RDF triple store

Methods
Results
Discussion
Conclusion

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.