A distributed query execution engine of big attributed graphs.

Omar Batarfi,Sherif Sakr,Ayman Fayoumi,Ahmed Barnawi,Radwa Elshawi

doi:10.1186/s40064-016-2251-0

Abstract

A graph is a popular data model that has become pervasively used for modeling structural relationships between objects. In practice, in many real-world graphs, the graph vertices and edges need to be associated with descriptive attributes. Such type of graphs are referred to as attributed graphs. G-SPARQL has been proposed as an expressive language, with a centralized execution engine, for querying attributed graphs. G-SPARQL supports various types of graph querying operations including reachability, pattern matching and shortest path where any G-SPARQL query may include value-based predicates on the descriptive information (attributes) of the graph edges/vertices in addition to the structural predicates. In general, a main limitation of centralized systems is that their vertical scalability is always restricted by the physical limits of computer systems. This article describes the design, implementation in addition to the performance evaluation of DG-SPARQL, a distributed, hybrid and adaptive parallel execution engine of G-SPARQL queries. In this engine, the topology of the graph is distributed over the main memory of the underlying nodes while the graph data are maintained in a relational store which is replicated on the disk of each of the underlying nodes. DG-SPARQL evaluates parts of the query plan via SQL queries which are pushed to the underlying relational stores while other parts of the query plan, as necessary, are evaluated via indexless memory-based graph traversal algorithms. Our experimental evaluation shows the efficiency and the scalability of DG-SPARQL on querying massive attributed graph datasets in addition to its ability to outperform the performance of Apache Giraph, a popular distributed graph processing system, by orders of magnitudes.

Highlights

In this era, we are witness continuous expansion and integration of computation, networking, digital devices and data storage systems in a way that provided a rich platform for the explosion in big data as well as the means by which big data are produced, stored, processed and analyzed
In DG-SPARQL, the topology of the graph is distributed over the main memory of the underlying nodes while the graph data are maintained in a relational store which is replicated on the disk of each of the underlying nodes (Hammoud et al 2015)
The query optimizer starts by compiling the user input query (Q) into a logical query plan QP using a defined set of G-SPARQL algebraic operators (Sakr et al 2012)

Summary

Introduction

We are witness continuous expansion and integration of computation, networking, digital devices and data storage systems in a way that provided a rich platform for the explosion in big data as well as the means by which big data are produced, stored, processed and analyzed. Batarfi et al SpringerPlus (2016) 5:665 users.1 It has become very crucial for several applications to have the ability of efficiently store, query and analyze these big graphs (Sakr and Pardede 2011). Attributed graph (Ehrig et al 2004) is a variant graph data model where each node is identified with a unique identifier and labeled with a string. Each edge in the attributed graph is identified with a unique identifier and labeled with a string. Each node or an edge can be associated with a collection of key/value pairs that represent its descriptive information or properties. Given a large attributed graph that includes billions of edges and nodes (e.g., bibliographic network, social network) with their descriptive information, one of the fundamental challenges is on how to efficiently query and analyze these big graphs Each edge e ∈ E can be associated with a vector of key/value pairs [b1(e1), . . . , bn(eu)] where bk (ek ) represents the attribute value of edge e on attribute bk

Methods

Results

Conclusion