Abstract

Big data processing environments such as Apache Spark are prominently deployed for applications with large scale workloads. New storage technologies such as Non-Volatile Memory Express Solid State Drives (NVMe SSDs) provide higher throughput comparing to the traditional Hard Disk Drives (HDDs). Therefore, NVMe SSDs are rapidly substituting HDDs in modern data centers. In this paper, we explore whether it is critically necessary to use NVMe SSD for a large workload running on the Spark big data framework. Specifically, we investigate what are the influential factors of application design and Spark data processing framework to exploit the benefits of NVMe SSDs. Our real experimental results reveal that some applications even with large workloads cannot fully utilize NVMe SSDs to obtain high I/O throughput. Interestingly, we find out that characteristics of Spark data processing framework such as shuffling (i.e., the volume of transition data generated by an application), and parallelism (i.e., the number of concurrently running tasks) has very crucial impacts on the performance of big data applications running on NVMe SSDs.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call