Evaluating SPARQL queries on massive RDF datasets

Razen Harbi,Ibrahim Abdelaziz,Panos Kalnis,Nikos Mamoulis

doi:10.14778/2824032.2824083

Abstract

Distributed RDF systems partition data across multiple computer nodes. Partitioning is typically based on heuristics that minimize inter-node communication and it is performed in an initial, data pre-processing phase. Therefore, the resulting partitions are static and do not adapt to changes in the query workload; as a result, existing systems are unable to consistently avoid communication for queries that are not favored by the initial data partitioning. Furthermore, for very large RDF knowledge bases, the partitioning phase becomes prohibitively expensive, leading to high startup costs. In this paper, we propose AdHash, a distributed RDF system which addresses the shortcomings of previous work. First, AdHash initially applies lightweight hash partitioning, which drastically minimizes the startup cost, while favoring the parallel processing of join patterns on subjects, without any data communication. Using a locality-aware planner, queries that cannot be processed in parallel are evaluated with minimal communication. Second, AdHash monitors the data access patterns and adapts dynamically to the query load by incrementally redistributing and replicating frequently accessed data. As a result, the communication cost for future queries is drastically reduced or even eliminated. Our experiments with synthetic and real data verify that AdHash (i) starts faster than all existing systems, (ii) processes thousands of queries before other systems become online, and (iii) gracefully adapts to the query load, being able to evaluate queries on billion-scale RDF data in sub-seconds. In this demonstration, audience can use a graphical interface of AdHash to verify its performance superiority compared to state-of-the-art distributed RDF systems.

Highlights

The RDF data model does not require a predefined schema and is a versatile way for representing information from diverse sources
A SPARQL query is decomposed into multiple subqueries that are evaluated by each node independently
Queries with large intermediate results incur high communication cost, which is detrimental to the query performance [7, 5]

Summary

INTRODUCTION

The RDF data model does not require a predefined schema and is a versatile way for representing information from diverse sources. Distributed RDF systems aim at minimizing the number of decomposed subqueries by partitioning the data carefully In other words, their goal is to partition the data such that each node has all the data it needs to evaluate the entire query, without exchanging intermediate results. Even sophisticated partitioning and replication cannot guarantee that arbitrarily complex SPARQL queries can be processed in parallel; expensive distributed query evaluation, with intermediate results exchanged between nodes cannot always be avoided. (ii) Adaptivity: WARP [6] and Partout [3] consider the workload during data partitioning They achieve a significant reduction in the replication ratio, while showing better query performance compared to systems that partition the data blindly. This way, AdHash overcomes the limitations of static partitioning schemes and adapts dynamically to changing workloads

Master

Worker

System overview

DEMONSTRATION DETAILS

Demonstration Setup

Demonstration Interface

Preprocessing Time

Findings

Query Performance

Full Text

Published version (

Free)

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: Proceedings of the VLDB Endowment	Publication Date: Aug 1, 2015
Citations: 55	License type: cc-by

R Discovery Prime

R Discovery Prime

Evaluating SPARQL queries on massive RDF datasets

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Proceedings of the VLDB Endowment

Lead the way for us

Similar Papers

Accelerating SPARQL queries by exploiting hash-based locality and adaptive partitioning
Razen Harbi ... Majed Sahli
The VLDB Journal The International Journal on Very Large Data Bases | VOL. 25
Razen Harbi, et. al.Razen Harbi ... Majed Sahli
08 Feb 2016
The VLDB Journal The International Journal on Very Large Data Bases | VOL. 25

Параллельная обработка и визуализация для результатов моделирования методом молекулярной динамики
D.V Puzyrkov ... S.V Polyakov
Proceedings of the Institute for System Programming of the RAS | VOL. 28
D.V Puzyrkov, et. al.D.V Puzyrkov ... S.V Polyakov
01 Jan 2015
Proceedings of the Institute for System Programming of the RAS | VOL. 28

Faster MaxScore Query Processing with Essential List Skipping
Kun Jiang ... Yuexiang Yang
-
Kun Jiang, et. al.Kun Jiang ... Yuexiang Yang
01 Jan 2014
01 Jan 2014

Faster top-k document retrieval using block-max indexes
Shuai Ding ... Torsten Suel
-
Shuai Ding, et. al.Shuai Ding ... Torsten Suel
24 Jul 2011
24 Jul 2011

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Evaluating SPARQL queries on massive RDF datasets

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Proceedings of the VLDB Endowment