Abstract

Distributed RDF systems partition data across multiple computer nodes. Partitioning is typically based on heuristics that minimize inter-node communication and it is performed in an initial, data pre-processing phase. Therefore, the resulting partitions are static and do not adapt to changes in the query workload; as a result, existing systems are unable to consistently avoid communication for queries that are not favored by the initial data partitioning. Furthermore, for very large RDF knowledge bases, the partitioning phase becomes prohibitively expensive, leading to high startup costs. In this paper, we propose AdHash, a distributed RDF system which addresses the shortcomings of previous work. First, AdHash initially applies lightweight hash partitioning, which drastically minimizes the startup cost, while favoring the parallel processing of join patterns on subjects, without any data communication. Using a locality-aware planner, queries that cannot be processed in parallel are evaluated with minimal communication. Second, AdHash monitors the data access patterns and adapts dynamically to the query load by incrementally redistributing and replicating frequently accessed data. As a result, the communication cost for future queries is drastically reduced or even eliminated. Our experiments with synthetic and real data verify that AdHash (i) starts faster than all existing systems, (ii) processes thousands of queries before other systems become online, and (iii) gracefully adapts to the query load, being able to evaluate queries on billion-scale RDF data in sub-seconds. In this demonstration, audience can use a graphical interface of AdHash to verify its performance superiority compared to state-of-the-art distributed RDF systems.

Highlights

  • The RDF data model does not require a predefined schema and is a versatile way for representing information from diverse sources

  • A SPARQL query is decomposed into multiple subqueries that are evaluated by each node independently

  • Queries with large intermediate results incur high communication cost, which is detrimental to the query performance [7, 5]

Read more

Summary

INTRODUCTION

The RDF data model does not require a predefined schema and is a versatile way for representing information from diverse sources. Distributed RDF systems aim at minimizing the number of decomposed subqueries by partitioning the data carefully In other words, their goal is to partition the data such that each node has all the data it needs to evaluate the entire query, without exchanging intermediate results. Even sophisticated partitioning and replication cannot guarantee that arbitrarily complex SPARQL queries can be processed in parallel; expensive distributed query evaluation, with intermediate results exchanged between nodes cannot always be avoided. (ii) Adaptivity: WARP [6] and Partout [3] consider the workload during data partitioning They achieve a significant reduction in the replication ratio, while showing better query performance compared to systems that partition the data blindly. This way, AdHash overcomes the limitations of static partitioning schemes and adapts dynamically to changing workloads

Master
Worker
System overview
DEMONSTRATION DETAILS
Demonstration Setup
Demonstration Interface
Preprocessing Time
Findings
Query Performance
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call