Abstract

Apache Spark enables fast computations and greatly accelerates analytics applications by efficiently utilizing the main memory and caching data for later use. At its core Apache Spark uses data structures called RDDs (Resilient Distributed Datasets) to give a unified view to the distributed data. However, the data represented in the RDDs remain unencrypted which can result in leakage of confidential data produced or processed by applications. Apache Spark persists (unencrypted) RDDs to the disk storage under various circumstances including but not limited to caching, RDD checkpointing and data spill during the data shuffling operations, etc. This lack of security makes Apache Spark unsuitable for processing of sensitive information that should be secured at all times. Moreover, RDDs stored in the main memory are prone to main-memory attacks such as RAM-scrapping. In this paper, we propose and develop solutions to fill-up such security lapses in the current Apache Spark framework. We present three different approaches to incorporate security in the Apache Spark framework. These approaches are designed to limit the exposure of unencrypted data during data processing, caching and data spill to disk. We use combination of cryptographic splitting and encryption to secure data stored and spilled by Apache Spark, both to the disk as well as to the main memory. Our approaches provide strong security by incorporating combination of Information Dispersal Algorithm (IDA) and Shamir's Perfect Secret Sharing (PSS). Extensive experimentation show that with appropriately chosen parameters our security approaches provide high security at a performance penalty between 10%–25%.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.