Semantic connection set-based massive RDF data query processing in Spark environment

Jiuyun Xu,Chao Zhang

doi:10.1186/s13638-019-1588-9

Abstract

Resource Description Framework (RDF) is a data representation of the Semantic Web, and its data volume is growing rapidly. Cloud-based systems provide a rich platform for managing RDF data. However, there is a performance challenge in the distributed environment when RDF queries, which contain multiple join operations, such as network reshuffle and memory overhead, are processed. To get over this challenge, this paper proposes a Spark-based RDF query architecture, which is based on Semantic Connection Set (SCS). First of all, the proposed Spark-based query architecture adopts the mechanism of re-partitioning class data based on vertical partitioning, which can reduce memory overhead and spend up index data. Secondly, a method for generating query plans based on semantic connection set is proposed in this paper. In addition, some statistics and broadcast variable optimization strategies are introduced to reduce shuffling and data communication costs. The experiments of this paper are based on the latest SPARQLGX on the Spark platform RDF system. Two synthetic benchmarks are used to evaluate the query. The experiment results illustrate that the proposed approach in this paper is more efficient in data search than contrast systems.

Highlights

Due to the rapid development of semantic web and knowledge graph, the amount of data represented by the Resource Description Framework (RDF) [1] has exploded
Unlike most existing systems that use a set of permutations of triples indexes, a VP-based storage schema is introduced which is for management massive RDF data by further partitioning rdf:type predicate based on vertical partitioning (VP) [9]
8 Conclusion In this paper, we introduce the Semantic Connection Set (SCS), an RDF query processing engine based on Spark

Summary

Introduction

Due to the rapid development of semantic web and knowledge graph, the amount of data represented by the Resource Description Framework (RDF) [1] has exploded. Unlike most existing systems that use a set of permutations of triples (subject, property, object) indexes, a VP-based storage schema is introduced which is for management massive RDF data by further partitioning rdf:type predicate based on vertical partitioning (VP) [9] This strategy is designed to minimize the size of the input data to achieve the goal of reducing memory overhead and supporting the fast indexing. S2RDF [15] uses the Spark SQL [16] interface to perform SPARQL queries It adopts VP method for data partition and performs semi-join processing on these VP tables, and generates multiple tables named ExtVP. The data preprocessing step creates significant data loading overhead, which may be two orders of magnitude larger than our solution

Preliminary

Findings

Conclusion