MixApart: decoupled analytics for shared storage systems

Madalin Mihailescu ,Gokul Soundararajan ,Cristiana Amza

doi:10.5555/2591272.2591287

Abstract

Distributed file systems built for data analytics and enterprise storage systems have very different functionality requirements. For this reason, enabling analytics on enterprise data commonly introduces a separate analytics storage silo. This generates additional costs, and inefficiencies in data management, e.g., whenever data needs to be archived, copied, or migrated across silos.MixApart uses an integrated data caching and scheduling solution to allow MapReduce computations to analyze data stored on enterprise storage systems. The front-end caching layer enables the local storage performance required by data analytics. The shared storage back-end simplifies data management.We evaluate MixApart using a 100-core Amazon EC2 cluster with micro-benchmarks and production workload traces. Our evaluation shows that MixApart provides (i) up to 28% faster performance than the traditional ingest-then-compute workflows used in enterprise IT analytics, and (ii) comparable performance to an ideal Hadoop setup without data ingest, at similar cluster sizes.

Full Text