Big Data Processing on Single Board Computer Clusters: Exploring Challenges and Possibilities

Eunseo Lee,Dongchul Park,Hyunju Oh

doi:10.1109/access.2021.3120660

Eunseo Lee, Dongchul Park + Show 1 more

Open Access

https://doi.org/10.1109/access.2021.3120660

Copy DOI

Abstract

For more than a decade, “big data” has been an industry and academia buzz phrase. Over this time, many companies adopted Apache Hadoop and Spark frameworks for their massive data storage and analysis efforts, using powerful, energy-hungry, general-purpose server as their big data processing platforms. But not all industry or academic fields want, or even need, such large systems. Moreover, capital costs aside, power consumption has also become a primary data center concern. Consequently, lower-cost, lower-power microservers have emerged as viable alternatives in many settings. Now, the latest generation Raspberry Pi (RPi), model 4B, exhibits significant computational performance improvements over its predecessors, and is presently considered a sufficiently powerful single board computer (SBC) to run many mainstream operating systems and accommodate heavy workloads. This paper reexamines SBC cluster big data processing possibilities by integrating the most powerful (presently) RPi model–the RPi 4B with 4 Gigabytes (GB) main memory. We examine external storage’s performance impact on such an SBC cluster’s big data processing performance by employing three different external storage solutions with measurably distinct performance characteristics. Moreover, we discuss challenges we encountered and identify further SBC cluster performance optimizations. We perform several representative big data application benchmarks and measure various key performance metrics such as execution time, power consumption, throughput, performance-per-dollars, etc. Our extensive experiments and comprehensive studies conclude this current, fourth-generation RPi has evolved to become the first generation to effectively run massive (i.e., more than 100GB) workloads in big data processing applications.

Highlights

Widespread high-speed Internet has exacerbated data production, driving efficient big data processing platforms in a ‘‘big data era’’
Both the Apache Hadoop and Spark platforms provide the foundation for this big data revolution
We explore the challenges and possibilities of the latest generation Raspberry Pi (RPi) for cluster-based big data processing

Summary

Introduction

Widespread high-speed Internet has exacerbated data production, driving efficient big data processing platforms in a ‘‘big data era’’. The big data has been a buzzword and various big data technology advances have made a crucial impact on our daily life as well as numerous industries [1] Both the Apache Hadoop and Spark platforms provide the foundation for this big data revolution. Industries have built powerful servers and clusters to best exploit these big data software platforms and they have been industries’ general-purpose big data processing or analysis workhorses (Figure 1). Because of their high cost and high power consumption, not all industries and academic fields need or can afford such powerful servers. The Raspberry Pi 4 Model B (4B) was released in June 2019 and exhibits a tremendous performance improvement over the previous model due to its full-chip redesign, the first in Raspberry Pi history: more powerful processing cores, the first graphics processor upgrade, vastly improved memory and external hardware bandwidth, including the first UBS 3.0 ports, full-speed Gigabit Ethernet, micro HDMI ports for 4K displays, up to 4GB RAM, etc. [21], [22]

Objectives

Findings

Discussion

Conclusion