Abstract

With the rapid development of technologies like the Internet, sensors and bioinformatics, and data has grown explosively. In the big data era, more and more iterative algorithms have been applied in the fields of data mining and machine learning. In most situation, the iterative algorithms compute in the entire dataset which are merged from the partial ones. Given the iterative results on partial datasets, it is efficient if the results on the entire dataset can be merged from them, otherwise the re-computing on entire one is time consuming. Unfortunately, current iteration model do not support the results merging. We propose merge iteration computing model (Mim) in this paper. Mim is a solution but not a platform. It states how to execute iterative algorithm effectively through reusing the exiting results without sacrificing the accuracy, and such mechanism is suitable for the most iterative algorithms. We explain the in-partition iteration step, error evaluation step, compensation step (optional), and merge iteration step of Mim, in addition, the in-partition iteration step is preliminary of merging iteration and should be done before the partial datasets are merged. We also analyze the accuracy and performance advantages of Mim theoretically. In the application scenarios, we implement Mim over Spark framework, and applied the Mim to the financial data analysis in a city. Finally, through a series of experiments, we prove the efficiency and accuracy of the proposed Mim on the PageRank and K-means algorithms. Under the various test cases, the maximum optimization ratio of Mim is 25% and 56% comparing with regular iteration on PageRank and K-means, respectively. And the errors are negligible.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call