BIG DATA PROCESSING WITH APACHE SPARK

Binh Duc Nguyen,Linh Thi Thuy Nguyen,Oanh Thi Thu Nguyen,Quy Quang Tran

doi:10.35382/tvujs.13.6.2023.2099

Binh Duc Nguyen, Linh Thi Thuy Nguyen + Show 2 more

Open Access

https://doi.org/10.35382/tvujs.13.6.2023.2099

Copy DOI

Abstract

With the exponential growth of information, it is no surprise that we are in a period of history as the Information Age. The rapid growth of data has presented challenges regarding storage and processing technology. This article refers to Apache Spark, an ecosystem that provides many integrated technologies in Big Data processing, including machine learning libraries and data storage platforms. Apache Spark provides distributed data processing for open source applications, loading data in-memory and making operations for analyzing data of any size, with efficient support for popular programming languages like Java, Scala, R, and Python. The article aims to compare the superior computing power of Saprk compared to Hadoop and how to connect Spark with today's popular data processing tools such as the R language.

Full Text