Elastic distributed dataset RDD (Resilient Distributed Datasets)

Yiwen Li

doi:10.54254/2755-2721/31/20230149

Yiwen Li

Open Access

PDF Available

https://doi.org/10.54254/2755-2721/31/20230149

Copy DOI

Export

Save

Cite

Journal: Applied and Computational Engineering	Publication Date: Jan 31, 2024
Citations: 2	License type: cc-by

Abstract
Full-Text PDF
Similar Papers

Abstract

Listen

In recent years, the amount of data to be processed has become larger and larger, and it is difficult for a single processor to handle data, so we need to work hard to develop the expansion mode of cluster mode. Mapreduce, Storm, etc. are a solution for clusters to process data. However, we still have parallel processing, user resource sharing and other issues to be addressed. RDD data model is the key to solve these problems. In this paper, RDD is applied to Spark for detailed expansion, and project experiments are carried out to study the RDD data model. The programming language used in this article is python. The advantage of studying distributed iterative computing is that during the calculation process, the task calculation fails and can be recovered. The default is to allow four failures recoverable. If the stage fails, one can recover, and if the partition fails, one can also recover. Multiple partitions can improve parallelism and efficiency. After a shuffle, a partition failed because the many-to-one relationship needs to be recalculated.

Full Text