Abstract

Distributed file systems built for data analytics and enterprise storage systems have very different functionality requirements. For this reason, enabling analytics on enterprise data commonly introduces a separate analytics storage silo. This generates additional costs, and inefficiencies in data management, e.g., whenever data needs to be archived, copied, or migrated across silos.MixApart uses an integrated data caching and scheduling solution to allow MapReduce computations to analyze data stored on enterprise storage systems. The front-end caching layer enables the local storage performance required by data analytics. The shared storage back-end simplifies data management.We evaluate MixApart using a 100-core Amazon EC2 cluster with micro-benchmarks and production workload traces. Our evaluation shows that MixApart provides (i) up to 28% faster performance than the traditional ingest-then-compute workflows used in enterprise IT analytics, and (ii) comparable performance to an ideal Hadoop setup without data ingest, at similar cluster sizes.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call