Abstract

Traditional Data Warehouse is a multidimensional repository. It is nonvolatile, ‎subject-oriented, integrated, time-variant, and non-‎operational data. It is gathered from multiple ‎heterogeneous data ‎sources. We need to adapt traditional Data Warehouse architecture to deal with the new ‎challenges imposed by the abundance of data and the current big data characteristics, containing ‎volume, value, variety, validity, volatility, visualization, variability, and venue. The new ‎architecture also needs to handle existing drawbacks, including availability, scalability, and ‎consequently query performance. This paper introduces a novel Data Warehouse architecture, named Lake ‎Data Warehouse Architecture, to provide the traditional Data Warehouse with the capabilities to ‎overcome the challenges. ‎Lake Data Warehouse Architecture depends on merging the traditional Data Warehouse architecture ‎with big data technologies, like the Hadoop framework and Apache Spark. It provides a hybrid ‎solution in a complementary way. The main advantage of the proposed architecture is that it ‎integrates the current features in ‎traditional Data Warehouses and big data features acquired ‎through integrating the ‎traditional Data Warehouse with Hadoop and Spark ecosystems. Furthermore, it is ‎tailored to handle a tremendous ‎volume of data while maintaining availability, reliability, and ‎scalability.‎

Highlights

  • Data warehouse (DW) has many benefits; it enhances Business Intelligence, data quality, and consistency, saves time, and supports historical data analysis and querying [1]

  • It adds additional features and capabilities that facilitate working with big data technologies and tools (Hadoop, Data Lake, Delta Lake, and Apache Spark) in a complementary way to support and enhance existing architecture

  • It is an extra storage layer that makes reliability to our data lakes built on The Hadoop Distributed File System (HDFS) and cloud storage [31]

Read more

Summary

INTRODUCTION

Data warehouse (DW) has many benefits; it enhances Business Intelligence, data quality, and consistency, saves time, and supports historical data analysis and querying [1]. In the age of big data with the massive increase in the data volume and types, there is a great need to apply more adequate architectures and technologies to deal with it. We propose a new DW architecture called Lake Data Warehouse Architecture. Lake Data Warehouse Architecture is a hybrid system that preserves the traditional DW features. It adds additional features and capabilities that facilitate working with big data technologies and tools (Hadoop, Data Lake, Delta Lake, and Apache Spark) in a complementary way to support and enhance existing architecture. Our proposed contribution solve several issues that face integrating data from big data repositories such as: Integrating traditional DW technique, Hadoop Framework, and Apache Spark.‎.

BACKGROUND
Hadoop Framework and Data Lake
Apache Spark and Delta Lake
RELATED WORKS
THE PROPOSED LAKE DATA WAREHOUSE ARCHITECTURE
Delta Lake architecture with Apache Spark Cloud Environment
Configure Apache Spark
Create a Parquet-based Data Lake Table
Exploring analysis results
CONCLUSION

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.