Efficient processing of complex XSD using Hive and Spark.

Rosa Navarrete,Diana Martinez-Mosquera,Sergio Luján-Mora

doi:10.7717/peerj-cs.652

Rosa Navarrete, Diana Martinez-Mosquera + Show 1 more

Open Access

https://doi.org/10.7717/peerj-cs.652

Copy DOI

Abstract

The eXtensible Markup Language (XML) files are widely used by the industry due to their flexibility in representing numerous kinds of data. Multiple applications such as financial records, social networks, and mobile networks use complex XML schemas with nested types, contents, and/or extension bases on existing complex elements or large real-world files. A great number of these files are generated each day and this has influenced the development of Big Data tools for their parsing and reporting, such as Apache Hive and Apache Spark. For these reasons, multiple studies have proposed new techniques and evaluated the processing of XML files with Big Data systems. However, a more usual approach in such works involves the simplest XML schemas, even though, real data sets are composed of complex schemas. Therefore, to shed light on complex XML schema processing for real-life applications with Big Data tools, we present an approach that combines three techniques. This comprises three main methods for parsing XML files: cataloging, deserialization, and positional explode. For cataloging, the elements of the XML schema are mapped into root, arrays, structures, values, and attributes. Based on these elements, the deserialization and positional explode are straightforwardly implemented. To demonstrate the validity of our proposal, we develop a case study by implementing a test environment to illustrate the methods using real data sets provided from performance management of two mobile network vendors. Our main results state the validity of the proposed method for different versions of Apache Hive and Apache Spark, obtain the query execution times for Apache Hive internal and external tables and Apache Spark data frames, and compare the query performance in Apache Hive with that of Apache Spark. Another contribution made is a case study in which a novel solution is proposed for data analysis in the performance management systems of mobile networks.

Highlights

The eXtensible Markup Language (XML) is widely used on the Internet for different purposes
The main difference from our study is that we evaluate the query execution times for XML documents of complex types from real-life mobile networks in Apache Hive and Apache Spark
Our research explains how to use cataloging, deserialization and positional explode to process complex XML Schema Definition (XSD) in Apache Hive internal and external tables and Apache Spark data frames; we demonstrate the validity of our proposal in a test Big Data environment with real Performance Measurement (PM) XML files from two mobile network vendors

Summary

Introduction

The eXtensible Markup Language (XML) is widely used on the Internet for different purposes. There are numerous XML-based applications that utilize tag-based and nested data structures (Chituc, 2017; Debreceny & Gray, 2001; Hong & Song, 2007) due to greater flexibility in the representation of different types of data: these can be customized by the user. The main constraint is that XML representation is inefficient in terms of processing and with respect to query times; for this reason, agile and intelligent search and. Efficient processing of complex XSD using Hive and Spark. It is important to focus on efficiently processing data for complex semi-structured data such as XML

Methods

Results

Conclusion