Split4Blank: Maintaining consistency while improving efficiency of loading RDF data with blank nodes.

Atsuko Yamaguchi,Yasunori Yamamoto

doi:10.1371/journal.pone.0217852

Abstract

In life sciences, accompanied by the rapid growth of sequencing technology and the advancement of research, vast amounts of data are being generated. It is known that as the size of Resource Description Framework (RDF) datasets increases, the more efficient loading to triple stores is crucial. For example, UniProt’s RDF version contains 44 billion triples as of December 2018. PubChem also has an RDF dataset with 137 billion triples. As data sizes become extremely large, loading them to a triple store consumes time. To improve the efficiency of this task, parallel loading has been recommended for several stores. However, with parallel loading, dataset consistency must be considered if the dataset contains blank nodes. By definition, blank nodes do not have global identifiers; thus, pairs of identical blank nodes in the original dataset are recognized as different if they reside in separate files after the dataset is split for parallel loading. To address this issue, we propose the Split4Blank tool, which splits a dataset into multiple files under the condition that identical blank nodes are not separated. The proposed tool uses connected component and multiprocessor scheduling algorithms and satisfies the above condition. Furthermore, to confirm the effectiveness of the proposed approach, we applied Split4Blank to two life sciences RDF datasets. In addition, we generated synthetic RDF datasets to evaluate scalability based on the properties of various graphs, such as a scale-free and random graph.

Highlights

Partly due to the rapid advancement of experimental equipment and data analysis environments, such as high-throughput sequencers, functional magnetic resonance imaging [1], and high performance computing clusters [2, 3], data driven approaches, i.e., data-intensive science, have become increasingly popular in life sciences
We start with the formal definition for an Resource Description Framework (RDF) graph as follows: An RDF triple (s, p, o) is an element of (I [ B) × I × (I [ B [ L) where I, L and B are a set of Internationalized Resource Identifier (IRI), a set of literals and a set of blank nodes, which are considered pairwise disjoint
For an RDF graph G with n triples and a positive integer m representing the number of files, we find m disjoint sets D1, . . ., Dm of triples in G with minimum maxi|Di| such that any blank node b 2 Di does not appear in Dj(i 61⁄4 j) and any triple t 2 G appears in Di for some i

Summary

Introduction

Partly due to the rapid advancement of experimental equipment and data analysis environments, such as high-throughput sequencers, functional magnetic resonance imaging [1], and high performance computing clusters [2, 3], data driven approaches, i.e., data-intensive science, have become increasingly popular in life sciences. Diverse types of data are produced, e.g., genome sequences and images. To understand functions in biological phenomena, we must interpret various types of large amounts of data in an integrated manner.

Methods

Results

Conclusion

Full Text

Paper version not known

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Split4Blank: Maintaining consistency while improving efficiency of loading RDF data with blank nodes.

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: PloS one

Lead the way for us

Journal: PloS one	Publication Date: Jun 4, 2019
License type: CC BY 4.0

Similar Papers

Fast Processing SPARQL Queries on Large RDF Data
Guang Yang ... Hai Jin
-
Guang Yang, et. al.Guang Yang ... Hai Jin
01 Aug 2016
01 Aug 2016

Graph pattern detection and structural redundancy reduction to compress named graphs
Tangina Sultana ... Young-Koo Lee
Information Sciences | VOL. 647
Tangina Sultana, et. al.Tangina Sultana ... Young-Koo Lee
28 Jul 2023
Information Sciences | VOL. 647

SPANG: a SPARQL client supporting generation and reuse of queries for distributed RDF databases
Hirokazu Chiba ... Ikuo Uchiyama
BMC Bioinformatics | VOL. 18
Hirokazu Chiba, et. al.Hirokazu Chiba ... Ikuo Uchiyama
08 Feb 2017
BMC Bioinformatics | VOL. 18

Efficiently Finding Paths Between Classes to Build a SPARQL Query for Life-Science Databases
Atsuko Yamaguchi ... Hongyan Wu
-
Atsuko Yamaguchi, et. al.Atsuko Yamaguchi ... Hongyan Wu
01 Jan 2015
01 Jan 2015

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Split4Blank: Maintaining consistency while improving efficiency of loading RDF data with blank nodes.

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: PloS one