Abstract

With the dramatic increase in internet users and their demand for real-time network performance, Spark has distributed computing environment has emerged. It is widely used due to its high-performance caching mechanism and high scalability. In the face of the unpredictability of data access patterns in the current big data environment, the data shuffling phase is prone to the problems of under-utilization of Spark cluster resources, high computational latency, and high task processing latency. Based on this, this paper proposes an intermediate data management strategy based on the data shuffling phase. Firstly, the size of the data generated in the data shuffling phase of the Spark platform is predicted by random sampling. The strength division strategy obtains the skewed data degree to obtain the part with excessive skew deviation. Finally, the adaptive data management strategy is applied to perform the corresponding computation tasks by the data deviation. In addition, to improve the response time, memory usage, and computation latency of Spark applications, an adaptive cache replacement algorithm based on RDD partition weights is proposed, which takes into account the influence of four weight factors such as computation cost, usage times, partition size and life cycle of RDDs by reasonably calculating the RDD partition weight values. Compared with the current mainstream baseline algorithms, the data management algorithm based on the data mash-up phase proposed in this paper can effectively reduce resource usage and computational response latency. The RDD-based partition weighted adaptive cache replacement algorithm proposed in this paper can fully use memory resources and effectively reduce the problem of resource wastage.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call