Efficient Distributed SPARQL Queries on Apache Spark

Saleh Albahli

doi:10.14569/ijacsa.2019.0100874

Abstract

RDF is a widely-accepted framework for describing metadata in the web due to its simplicity and universal graph-like data model. Owing to the abundance of RDF data, existing query techniques are rendered unsuitable. To this direction, we adopt the processing power of Apache Spark to load and query a large dataset much more quickly than classical approaches. In this paper, we have designed experiments to evaluate the performance of several queries ranging from single attribute selection to selection, filtering and sorting multiple attributes in the dataset. We further experimented with the performance of queries using distributed SPARQL query on Apache Spark GraphX and studied different stages involved in this pipeline. The execution of distributed SPARQL query on Apache Spark GraphX helped us study its performance and gave insights into which stages of the pipeline can be improved. The query pipeline comprised of Graph loading, Basic Graph Pattern and Result calculating. Our goal is to minimize the time during graph loading stage in order to improve overall performance and cut the costs of data loading.

Highlights

The semantic web research came a long way from labelling different web pages and linking information to developing better processing systems and efficiently querying semantic web data for information
We further experimented with the performance of queries using distributed SPARQL query on Apache Spark GraphX and studied different stages involved in this pipeline
The obtained results showed a good query response time while using Spark based SPARQL comparing with Jena baseline performance results

Summary

INTRODUCTION

The semantic web research came a long way from labelling different web pages and linking information to developing better processing systems and efficiently querying semantic web data for information. Apache Spark as a MapReduce framework proposes parallel computation using distributed main-memory data abstraction i.e. 1) Resilient Distributed Data Sets (RDD), a distributed lineage supported fault tolerant data abstraction for in memory computations and 2) Data Frames (DF), a compressed and schema-enabled data abstraction [6] These data abstractions make programming queries easier by enabling translation and processing of high level query expressions such as SPARQL. The goal is to support different data mining tasks while improving semantic web and exploring vast datasets for innovative insights We are achieving this goal by utilizing a cluster’s parallelism i.e. our system is able to load and query a large dataset much more quickly than traditional approaches. Our approach of using SPARQL query processing is enhanced with Apache Spark GraphX on different sets of queries using large semantic web datasets. Our goal was to minimize the time during graph loading stage in order to improve overall performance and cut the costs of data loading

RELATED STUDY

Experimental Setup

Experiment Design

Distributed SPARQL Performance

Distributed SPARQL Analysis

Query Complexity

Findings

CONCLUSION AND FUTURE WORKS

Full Text

Published version (

Free)

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Efficient Distributed SPARQL Queries on Apache Spark

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: International Journal of Advanced Computer Science and Applications

Lead the way for us

Journal: International Journal of Advanced Computer Science and Applications	Publication Date: Jan 1, 2019
License type: cc-by

Similar Papers

Distributed SPARQL Query Processing: a Case Study with Apache Spark
Bernd Amann ... Hubert Naacke
-
Bernd Amann, et. al.Bernd Amann ... Hubert Naacke
06 Aug 2018
06 Aug 2018

S3QLRDF: distributed SPARQL query processing using Apache Spark—a comparative performance study
Mahmudul Hassan ... Srividya Bansal
Distributed and parallel databases | VOL. 41
Mahmudul Hassan, et. al.Mahmudul Hassan ... Srividya Bansal
24 Jan 2023
Distributed and parallel databases | VOL. 41

SparkRDF: In-Memory Distributed RDF Management Framework for Large-Scale Social Data
Zhichao Xu ... Lei Gai
-
Zhichao Xu, et. al.Zhichao Xu ... Lei Gai
01 Jan 2015
01 Jan 2015

Document-based RDF storage method for parallel evaluation of basic graph pattern queries
Eleftherios Kalogeros ... Matthew Damigos
International journal of metadata, semantics and ontologies | VOL. 14
Eleftherios Kalogeros, et. al.Eleftherios Kalogeros ... Matthew Damigos
01 Jan 2020
International journal of metadata, semantics and ontologies | VOL. 14

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Efficient Distributed SPARQL Queries on Apache Spark

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: International Journal of Advanced Computer Science and Applications