Efficient finer-grained incremental processing with MapReduce for big data

Liang Zhang,Yuanyuan Feng,Peiyi Shen,Guangming Zhu,Wei Wei,Juan Song,Syed Afaq Ali Shah,Mohammed Bennamoun

doi:10.1016/j.future.2017.09.079

Abstract

With the continuous development of the Internet and information technology, more and more mobile terminals, wear equipment etc. contribute to the tremendous data. Thanks to the distributed computing, we can analyze the big data with quite high speed. However, many kinds of big data have an obvious common character that the datasets grow incrementally overtime, which means the distributed computing should focus on incremental processing. A number of systems for incremental data processing are available, such as Google’s Percolator and Yahoo’s CBP. However, in order to utilize these mature framework, one needs to make a troublesome change for their program to adapt to the environment requirement.In this paper, we introduce a MapReduce framework, named HadInc, for efficient incremental computations. HadInc is designed for offline scenes, in which real-time is needless and in-memory cluster computing is invalid. HadInc takes the advantages of finer-grained computing and Content-defined Chunking(CDC) to make sure that the system can still reuse the results which we have computed before, even if the split data has been changed seriously. Instead of re-computing the changed data entirely, HadInc can quickly find out the difference between the new split and the old one, and then merge the delta and old results into the latest result of the new datasets. Meanwhile, the dividing stability of the datasets is a key factor for reusing the results. In order to guarantee the stability of the dataset’s division, we propose a series of novel algorithms based on CDC.We implemented HadInc by extending the Hadoop framework, and evaluated it with many experiments including three specific cases and a practical case. From the comparing results it can be seen that the proposed HadInc is very efficient.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Efficient finer-grained incremental processing with MapReduce for big data

Abstract

Talk to us

Similar Papers

More From: Future Generation Computer Systems

Lead the way for us

Journal: Future Generation Computer Systems	Publication Date: Oct 20, 2017
Citations: 10

Similar Papers

Handling Big Data Scalability in Biological Domain Using Parallel and Distributed Processing: A Case of Three Biological Semantic Similarity Measures.
Ameera M Almasoud ... Abdulmalik S Al-Salman
BioMed Research International | VOL. 2019
Ameera M Almasoud, et. al.Ameera M Almasoud ... Abdulmalik S Al-Salman
27 Jan 2019
BioMed Research International | VOL. 2019

Examining the interplay between big data analytics and contextual factors in driving process innovation capabilities
Patrick Mikalef ... John Krogstie
European Journal of Information Systems | VOL. 29
Patrick Mikalef, et. al.Patrick Mikalef ... John Krogstie
16 Apr 2020
European Journal of Information Systems | VOL. 29

Genetic optimized data deduplication for distributed big data storage systems
Naresh Kumar ... Shobha Antwal
-
Naresh Kumar, et. al.Naresh Kumar ... Shobha Antwal
01 Sep 2017
01 Sep 2017

빅데이터의 효과적인 처리 및 활용을 위한 클라이언트-서버 모델 설계
Dae Seo Park ... Hwa Jong Kim
Journal of Intelligence and Information Systems | VOL. 22
Dae Seo Park, et. al.Dae Seo Park ... Hwa Jong Kim
31 Dec 2016
Journal of Intelligence and Information Systems | VOL. 22

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Efficient finer-grained incremental processing with MapReduce for big data

Abstract

Talk to us

Similar Papers

More From: Future Generation Computer Systems