Research on Data Storage and Processing Optimization Based on Federation HDFS and Spark

He Xu,Wenkang Xie,Fangzhou Chen,Peng Li

doi:10.1007/978-3-319-93659-8_97

Abstract

Hadoop and Spark provide undifferentiated services for data storage and processing, which can make it unable to meet on-demand services of different users or different types of data. Based on the above situation, this paper proposes a system architecture for data storage optimization based on Federation HDFS and Spark. According to Naive Bayes algorithm, the data of different types or different users received are divided. The divided results are stored in Federation HDFS with different backup policies and Spark is used to process data according to the priority at the same time. Based on the method described above, differential service can be realized and service quality can be improved. The experimental results show that the data storage and processing system architecture can provide different storage strategies and processing priorities for different priority data, which can also provide high fault tolerance and reduce data processing delay for high priority data.

Full Text