Performance analysis of distributed storage clusters based on kernel and userspace traces

Houssem Daoud,Michel R Dagenais

doi:10.1002/spe.2889

Abstract

SummaryDistributed storage systems are commonly used in modern computing. They are highly scalable and offer data replication and fault tolerance. The complexity of those systems makes them difficult to debug using traditional tools. The existing tools are able to evaluate the overall performance of such systems but they do not provide enough information to find the root cause of performance issues. In this article, we propose a tracing‐based performance analysis framework for storage clusters. We use a tracing strategy that reduces the tracing overhead in production systems. The traces collected from the different storage nodes are correlated and used to generate a data model that represents the cluster. Userspace tracing is used to gather data from the storage daemons, while Kernel tracing is used to provide detailed information about operating system internals such as disk queues, network queues and process scheduling. Efficient data structures are used to store the model and to generate metrics and graphical views. Our tool is used in different real world scenarios and is able to investigate interesting performance problems including I/O latencies, data replication and storage nodes failures.

Full Text