OWL reasoning framework over big biological knowledge network.

Huajun Chen,Xi Chen,Tong Yu,Zhaohui Wu,Peiqin Gu

doi:10.1155/2014/272915

Abstract

Recently, huge amounts of data are generated in the domain of biology. Embedded with domain knowledge from different disciplines, the isolated biological resources are implicitly connected. Thus it has shaped a big network of versatile biological knowledge. Faced with such massive, disparate, and interlinked biological data, providing an efficient way to model, integrate, and analyze the big biological network becomes a challenge. In this paper, we present a general OWL (web ontology language) reasoning framework to study the implicit relationships among biological entities. A comprehensive biological ontology across traditional Chinese medicine (TCM) and western medicine (WM) is used to create a conceptual model for the biological network. Then corresponding biological data is integrated into a biological knowledge network as the data model. Based on the conceptual model and data model, a scalable OWL reasoning method is utilized to infer the potential associations between biological entities from the biological network. In our experiment, we focus on the association discovery between TCM and WM. The derived associations are quite useful for biologists to promote the development of novel drugs and TCM modernization. The experimental results show that the system achieves high efficiency, accuracy, scalability, and effectivity.

Highlights

With the explosive growth of biological data on the web, large volume data sets are generated rapidly in the field of biology
We present a general OWL reasoning framework for modeling, integration, and analysis of the big biological network
(iii) We propose several MapReduce-based property chain reasoning algorithms to discover the implicit associations between entities from the big biological knowledge network

Summary

Introduction

With the explosive growth of biological data on the web, large volume data sets are generated rapidly in the field of biology. Up to February 2014, linked life data (LLD), a data integration platform in the biological domain (http://linkedlifedata.com/sources.html), contains 10,192,505,364 statements and 1,553,620,636 entitlements. Entrez Gene has more than 100 million gene records (http://www.ncbi.nlm.nih.gov/gene/). UniProt [1] knowledge base (UniProtKB/Swiss-Prot) contains 53,249,714 sequence entries, comprising about 10 billion amino acids (ftp://ftp.uniprot.org/pub/databases/uniprot/relnotes.txt). Besides the obvious scalability issues, heterogeneities from different resources are another major challenge for big biological data integration and analysis. Biological data covers a quite wide range, including proteins, pathways, diseases, targets, genes, Chinese medical herbs, symptoms, and syndromes, which usually come from multiple isolated sources and have different formats and taxonomies

Results

Discussion

Conclusion