Abstract

Over the past few years, cloud file systems such as Google File System (GFS) and Hadoop Distributed File System (HDFS) have received a lot of research efforts to optimize their designs and implementations. A common issue for these efforts is performance benchmarking. Unfortunately, many system researchers and engineers face challenges on making a benchmark that reflects real-life workload cases, due to the complexity of cloud file systems and vagueness of I/O workload characteristics. They could easily make incorrect assumptions about their systems and workloads, leading to the benchmark results differing from the fact. As the preliminary step for designing a realistic benchmark, we make an effort to explore the characteristics of data and I/O workload in a production environment. We collected a two-week I/O workload trace from a 2,500-node production cluster, which is one of the largest cloud platforms in Asia. This cloud platform provides two public cloud services: data storage service (DSS) and data processing service (DPS). We analyze the commonalities and individualities between both cloud services in multiple perspectives, including the request arrival pattern, request size, data population and so on. Eight key observations are highlighted from the comprehensive study, including the arrival rate of requests follows a Lognormal distribution rather than a Poisson distribution, request arrival presents multiple periodicities, cloud file systems fit partly-open model rather than purely open model or closed model. Based on the comparative analysis results, we derive several interesting implications on guiding system researchers and engineers to build a realistic benchmark on their own systems. Finally, we discuss several open issues and challenges raised on benchmarking cloud file systems.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call