Disk Bottleneck Research Articles

In deduplication, index-lookup disk bottleneck is a major obstacle which limits the throughput of backup processes. One way to minimize the effect of this issue and boost speed is to use very high course-grained chunks for deduplication at a cost of low storage saving and limited scalability. Another way is to distribute the deduplication process among multiple nodes but this approach introduces storage node island effect and also incurs high communication cost. In this paper, we explore dCACH, a content-aware clustered and hierarchical deduplication system, which implements a hybrid of inline course grained and offline fine-grained distributed deduplication where routing decisions are made for a set of files instead of single files. It utilizes bloom filters for detecting similarity between a data stream and previous data streams and performs stateful routing which solves the storage node island problem. Moreover, it exploits the negligibly small amount of content shared among chunks from different file types to create groups of files and deduplicate each group in their own fingerprint index space. It implements hierarchical deduplication to reduce the size of fingerprint indexes at the global level, where only files and big sized segments are deduplicated. Locality is created and exploited first using the big sized segments deduplicated at the global level and second by routing a set of consecutive files together to one storage node. Furthermore, the use of bloom filter for similarity detection between streams has low communication and computation cost while it enables to achieve duplicate elimination performance comparable to single node deduplication. dCACH is evaluated using a prototype deployed on a server environment distributed over four separate machines. It is shown to have 10× the speed of Extreme_Binn with a minimal communication overhead, while its duplicate elimination effectiveness is on a par with a single node deduplication system.

Although the Markov Chain Monte Carlo (MCMC) is very popular in parameter inference, the alleviation of the burden of calculation is crucial due to the limit of processors, memory, and disk bottleneck. This is especially true in terms of handling big data. In recent years, researchers have developed a parallel MCMC algorithm, in which full data are partitioned into subdatasets. Samples are drawn from the subdatasets independently at different machines without communication. In the extant literature, all machines are deemed to be identical. However, due to the heterogeneity of the data put into different machines, and the random nature of MCMC, the assumption of “identical machines” is questionable. Here we propose a Powered Embarrassing Parallel MCMC (PEPMCMC) algorithm, in which the full data posterior density is the product of the sub-posterior densities (posterior densities of different subdatasets) raised by some constraint powers. This is proven to be equivalent to a weighted averaging procedure. In our work, the powers are determined based on a maximum likelihood criterion, which leads to finding a maximum likelihood point within the convex hull of the estimates from different machines. We prove the asymptotic exactness and apply it to several cases to verify its strength in comparison with the unparallel and unpowered parallel algorithms. Furthermore, the connection between normal kernel density and parametric density estimations under certain conditions is investigated.

Disk Bottleneck Research Articles

Related Topics

Articles published on Disk Bottleneck

DCACH: Content Aware Clustered and Hierarchical Distributed Deduplication

Powered embarrassing parallel MCMC sampling in Bayesian inference, a weighted average intuition

Managed acceleration for In-Memory database analytic workloads

IMPLEMENTATION AND EVALUATION OF RUNTIME DATA DECLUSTERING METHOD OVER SAN-CONNECTED PC CLUSTER

GMBlock: Optimizing data movement in a block-level storage sharing system over Myrinet

Scalable high performance de-duplication backup via hash join

Lead the way for us

Editage

Paperpal

R Discovery

Mind the Graph

Disk Bottleneck Research Articles

Related Topics

Articles published on Disk Bottleneck

DCACH: Content Aware Clustered and Hierarchical Distributed Deduplication

Powered embarrassing parallel MCMC sampling in Bayesian inference, a weighted average intuition

Managed acceleration for In-Memory database analytic workloads

IMPLEMENTATION AND EVALUATION OF RUNTIME DATA DECLUSTERING METHOD OVER SAN-CONNECTED PC CLUSTER

GMBlock: Optimizing data movement in a block-level storage sharing system over Myrinet

Scalable high performance de-duplication backup via hash join