Abstract

In recent years, the amount of data to be processed has become larger and larger, and it is difficult for a single processor to handle data, so we need to work hard to develop the expansion mode of cluster mode. Mapreduce, Storm, etc. are a solution for clusters to process data. However, we still have parallel processing, user resource sharing and other issues to be addressed. RDD data model is the key to solve these problems. In this paper, RDD is applied to Spark for detailed expansion, and project experiments are carried out to study the RDD data model. The programming language used in this article is python. The advantage of studying distributed iterative computing is that during the calculation process, the task calculation fails and can be recovered. The default is to allow four failures recoverable. If the stage fails, one can recover, and if the partition fails, one can also recover. Multiple partitions can improve parallelism and efficiency. After a shuffle, a partition failed because the many-to-one relationship needs to be recalculated.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.