Abstract

RDF is a widely-accepted framework for describing metadata in the web due to its simplicity and universal graph-like data model. Owing to the abundance of RDF data, existing query techniques are rendered unsuitable. To this direction, we adopt the processing power of Apache Spark to load and query a large dataset much more quickly than classical approaches. In this paper, we have designed experiments to evaluate the performance of several queries ranging from single attribute selection to selection, filtering and sorting multiple attributes in the dataset. We further experimented with the performance of queries using distributed SPARQL query on Apache Spark GraphX and studied different stages involved in this pipeline. The execution of distributed SPARQL query on Apache Spark GraphX helped us study its performance and gave insights into which stages of the pipeline can be improved. The query pipeline comprised of Graph loading, Basic Graph Pattern and Result calculating. Our goal is to minimize the time during graph loading stage in order to improve overall performance and cut the costs of data loading.

Highlights

  • The semantic web research came a long way from labelling different web pages and linking information to developing better processing systems and efficiently querying semantic web data for information

  • We further experimented with the performance of queries using distributed SPARQL query on Apache Spark GraphX and studied different stages involved in this pipeline

  • The obtained results showed a good query response time while using Spark based SPARQL comparing with Jena baseline performance results

Read more

Summary

INTRODUCTION

The semantic web research came a long way from labelling different web pages and linking information to developing better processing systems and efficiently querying semantic web data for information. Apache Spark as a MapReduce framework proposes parallel computation using distributed main-memory data abstraction i.e. 1) Resilient Distributed Data Sets (RDD), a distributed lineage supported fault tolerant data abstraction for in memory computations and 2) Data Frames (DF), a compressed and schema-enabled data abstraction [6] These data abstractions make programming queries easier by enabling translation and processing of high level query expressions such as SPARQL. The goal is to support different data mining tasks while improving semantic web and exploring vast datasets for innovative insights We are achieving this goal by utilizing a cluster’s parallelism i.e. our system is able to load and query a large dataset much more quickly than traditional approaches. Our approach of using SPARQL query processing is enhanced with Apache Spark GraphX on different sets of queries using large semantic web datasets. Our goal was to minimize the time during graph loading stage in order to improve overall performance and cut the costs of data loading

RELATED STUDY
Experimental Setup
Experiment Design
Distributed SPARQL Performance
Distributed SPARQL Analysis
Query Complexity
Findings
CONCLUSION AND FUTURE WORKS
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call