Efficiently Processing and Storing Library Linked Data using Apache Spark and Parquet

Kumar Sharma,Ujjal Marjit,Utpal Biswas

doi:10.6017/ital.v37i3.10177

Kumar Sharma, Ujjal Marjit + Show 1 more

Open Access

https://doi.org/10.6017/ital.v37i3.10177

Copy DOI

Journal: Information Technology and Libraries	Publication Date: Sep 26, 2018
Citations: 3	License type: CC BY 3.0

Affiliation: University of Kalyani

Abstract

Resource Description Framework (RDF) is a commonly used data model in the Semantic Web environment. Libraries and various other communities have been using the RDF data model to store valuable data after it is extracted from traditional storage systems. However, because of the large volume of the data, processing and storing it is becoming a nightmare for traditional data-management tools. This challenge demands a scalable and distributed system that can manage data in parallel. In this article, a distributed solution is proposed for efficiently processing and storing the large volume of library linked data stored in traditional storage systems. Apache Spark is used for parallel processing of large data sets and a column-oriented schema is proposed for storing RDF data. The storage system is built on top of Hadoop Distributed File Systems (HDFS) and uses the Apache Parquet format to store data in a compressed form. The experimental evaluation showed that storage requirements were reduced significantly as compared to Jena TDB, Sesame, RDF/XML, and N-Triples file formats. SPARQL queries are processed using Spark SQL to query the compressed data. The experimental evaluation showed a good query response time, which significantly reduces as the number of worker nodes increases.

Highlights

More and more organizations, communities, and research-development centers are using Semantic Web technologies to represent data using Resource Description Framework (RDF)
This section briefly describes the structure of RDF triples, Apache Spark along with its features and column-oriented database system, and Apache Parquet
For the purposes of the experiment, some SPARQL queries are selected and tested over RDF data stored in Parquet format into Hadoop Distributed File Systems (HDFS)

Summary

Introduction

Communities, and research-development centers are using Semantic Web technologies to represent data using RDF. Unlike in relational tables, where we define columns during schema definition and those columns must contain the required type of data, in RDF we can have any number of properties and data using any kind of vocabulary. RDF represents resources in the form of subject, predicate, and object. This subject can have any number of property-value pairs This way representation of a resource is called knowledge representation, where everything is defined as a knowledge in the form of entity attribute value (EAV). In RDF, the basic unit of information is a triple T, such that T = {Subject, Predicate, Object}. Such information when stored on disk is called a triplestore.

Results

Discussion

Conclusion