Abstract

Data intensive applications are widely existed, such as massive data mining, search engine and high-throughput computing in bioinformatics, etc. Data processing becomes a bottleneck as the scale keeps bombing. However, the cost of processing the large scale dataset increases dramatically in traditional relational database, because traditional technology inclines to adopt high performance computer. The boost of cloud computing brings a new solution for data processing due to the characteristics of easy scalability, robustness, large scale storage and high performance. It provides a cost effective platform to implement distributed parallel data processing algorithms. In this paper, we proposed CPLDP (Cloud based Parallel Large Data Processing System), which is an innovative MapReduce based parallel data processing system developed to satisfy the urgent requirements of large data processing. In CPLDP system, we proposed a new method called operation dependency analysis to model data processing workflow and furthermore, reorder and combine some operations when it is possible. Such optimization reduces intermediate file read and write. The performance test proves that the optimization of processing workflow can reduce the time and intermediate results.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.