Abstract

Edge computing has become a new computing paradigm with explosive growth in recent years. We consider the problem of pushing data deduplication to the network edge and propose a new framework for distributed edge-facilitated deduplication (EF-dedup). Deduplication at the network edge allows us to exploit the high degree of geographic- and temporal-correlation in edge data to achieve space efficiency. By leveraging distributed computing power available on the edge in a collaborative fashion, the edge nodes can effectively suppress duplicated edge data, consuming considerably less space and WAN bandwidth. To this end, we partition the edge nodes into disjoint collaborative clusters, maintain a deduplication index structure across them using a distributed key-value store and perform deduplication within those clusters. However, this partitioning problem is very challenging and requires the optimization of a novel tradeoff: edge nodes with highly correlated data may not always be within the same edge cloud, with non-trivial network cost among them. We formulate a joint storage and network optimization problem with different design objectives, such as arbitrary partitioning and balanced partitioning of edge nodes. The problem is shown to be NP-Hard in general. Then, an optimization framework with efficient algorithms is developed and is proven to achieve a closed-form competitive ratio. Our experiments, performed on edge nodes in a corporate lab <sup xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">1</sup> and a central cloud at AWS, demonstrate that EF-dedup achieves 67.4 <inline-formula xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink"><tex-math notation="LaTeX">$\sim$</tex-math></inline-formula> 133.7% better deduplication throughput than sole cloud-based techniques and achieves 20.0-62.6 <inline-formula xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink"><tex-math notation="LaTeX">$\%$</tex-math></inline-formula> lesser aggregate cost in terms of the network-storage trade-off as compared to approaches that solely favor one over the other.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call