Abstract

In recent years, the amount of data to be processed has become larger and larger, and it is difficult for a single processor to handle data, so we need to work hard to develop the expansion mode of cluster mode. Mapreduce, Storm, etc. are a solution for clusters to process data. However, we still have parallel processing, user resource sharing and other issues to be addressed. RDD data model is the key to solve these problems. In this paper, RDD is applied to Spark for detailed expansion, and project experiments are carried out to study the RDD data model. The programming language used in this article is python. The advantage of studying distributed iterative computing is that during the calculation process, the task calculation fails and can be recovered. The default is to allow four failures recoverable. If the stage fails, one can recover, and if the partition fails, one can also recover. Multiple partitions can improve parallelism and efficiency. After a shuffle, a partition failed because the many-to-one relationship needs to be recalculated.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call