Abstract

Capture and representation of scientific knowledge in a structured format are essential to improve the understanding of biological mechanisms involved in complex diseases. Biological knowledge and knowledge about standardized terminologies are difficult to capture from literature in a usable form. A semi-automated knowledge extraction workflow is presented that was developed to allow users to extract causal and correlative relationships from scientific literature and to transcribe them into the computable and human readable Biological Expression Language (BEL). The workflow combines state-of-the-art linguistic tools for recognition of various entities and extraction of knowledge from literature sources. Unlike most other approaches, the workflow outputs the results to a curation interface for manual curation and converts them into BEL documents that can be compiled to form biological networks. We developed a new semi-automated knowledge extraction workflow that was designed to capture and organize scientific knowledge and reduce the required curation skills and effort for this task. The workflow was used to build a network that represents the cellular and molecular mechanisms implicated in atherosclerotic plaque destabilization in an apolipoprotein-E-deficient (ApoE −/− ) mouse model. The network was generated using knowledge extracted from the primary literature. The resultant atherosclerotic plaque destabilization network contains 304 nodes and 743 edges supported by 33 PubMed referenced articles. A comparison between the semi-automated and conventional curation processes showed similar results, but significantly reduced curation effort for the semi-automated process. Creating structured knowledge from unstructured text is an important step for the mechanistic interpretation and reusability of knowledge. Our new semi-automated knowledge extraction workflow reduced the curation skills and effort required to capture and organize scientific knowledge. The atherosclerotic plaque destabilization network that was generated is a causal network model for vascular disease demonstrating the usefulness of the workflow for knowledge extraction and construction of mechanistically meaningful biological networks.

Highlights

  • The volume of scientific knowledge has increased rapidly in the past 50 years

  • The efficiency of the semi-automated curation workflow and manual knowledge extraction was evaluated (Table 1) and the results showed that the semi-automated knowledge extraction workflow took less time than the conventional manual extraction (395 min ($6 h) for semi-automated vs. 613 min ($10 h) for manual)

  • Even when the statements were extracted from the same sentences, partly different statements were produced. An example of such differently coded Biological Expression Language (BEL) statements is given in the following example where the evidence was extracted from PMID: 21120482 [43]: ‘capillary vessel counting in and around primary tumors showed that CYP4A11 transfection significantly increased microvessel density per high-powered fields (HPF) (34.1 6 7.3/HPF in control and 35.32 6 6.4/HPF in GFP group vs. 63.8 6 11.4/ HPF in A549-CYP4A11 group, P < 0.05)’

Read more

Summary

Introduction

The volume of scientific knowledge has increased rapidly in the past 50 years. Medline, the most comprehensive bibliographic database in the life sciences, currently indexes more than 5000 journals and contains abstracts of more than 20 million articles (http://www.nlm.nih.gov/bsd/ index_stats_comp.html). Controlled vocabularies or ontologies such as the Gene Ontology (GO) (http://www.geneontology.org/) have been developed to capture the biological data found in literature. These ontologies are used consistently across different MODs and are amenable to computer manipulation. In this context, text mining tools for managing information recognition and extraction have become increasingly relevant [3]. Despite the striking progress in biocuration and text mining approaches in the context of curated databases, little progress has been made in writing scientific knowledge in a structured and computable form. Scientific knowledge curated at the system level will help researchers rapidly query, visualize and analyse the specific interaction networks implicated in diseases and open new opportunities for the identification of critical biomedical entities as therapeutic targets [10,11,12]

Methods
Results
Discussion
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call