Scalable communication event tracing via clustering

Amir Bahmani,Frank Mueller

doi:10.1016/j.jpdc.2017.06.008

Amir Bahmani, Frank Mueller

Open Access

https://doi.org/10.1016/j.jpdc.2017.06.008

Copy DOI

Abstract

Communication traces help developers of high-performance computing (HPC) applications understand and improve their codes. When run on large-scale HPC facilities, the scalability of tracing tools becomes a challenge. To address this problem, traces can be clustered into groups of processes that exhibit similar behavior. Instead of collecting trace information of each individual node, it then suffices to collect a trace of a small set of representative nodes, namely one per cluster. However, clustering algorithms themselves need to have low overhead, be scalable, and adapt to application characteristics. We devised an adaptive clustering algorithm for large-scale applications called ACURDION that traces the MPI communication of code with O(log P) time complexity. First, ACURDION identifies the parameters that differ across processes by using a logarithmic algorithm called Adaptive Signature Building. Second, it clusters the processes based on those parameters. Experiments show that collecting traces of just nine nodes/clusters suffices to capture the communication behavior of all nodes for a wide set of HPC benchmarks codes while retaining sufficient accuracy of trace events and parameters. In summary, ACURDION improves trace scalability and automation over prior approaches.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: Journal of Parallel and Distributed Computing	Publication Date: Jun 21, 2017
Citations: 5	License type: publisher-specific-oa

R Discovery Prime

R Discovery Prime

Scalable communication event tracing via clustering

Abstract

Talk to us

Similar Papers

More From: Journal of Parallel and Distributed Computing

Lead the way for us

Similar Papers

ACURDION: An adaptive clustering-based algorithm for tracing large-scale MPI applications
Amir Bahmani ... Frank Mueller
-
Amir Bahmani, et. al.Amir Bahmani ... Frank Mueller
01 Oct 2015
01 Oct 2015

HPC Process and Optimal Network Device Affinitization
Ravindra Babu Ganapathi ... Aravind Gopalakrishnan
IEEE Transactions on Multi-Scale Computing Systems | VOL. 4
Ravindra Babu Ganapathi, et. al.Ravindra Babu Ganapathi ... Aravind Gopalakrishnan
01 Oct 2018
IEEE Transactions on Multi-Scale Computing Systems | VOL. 4

Improving HPC Application Performance in Public Cloud
Rashid Hassani ... Peter Luksch
IERI Procedia | VOL. 10
Rashid Hassani, et. al.Rashid Hassani ... Peter Luksch
01 Jan 2014
IERI Procedia | VOL. 10

Optimization of performance and scheduling of HPC applications in cloud using cloudsim and scheduling approach
D Boobala Muralitharan ... S Arockia Babi Reebha
-
D Boobala Muralitharan, et. al.D Boobala Muralitharan ... S Arockia Babi Reebha
01 May 2017
01 May 2017

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Scalable communication event tracing via clustering

Abstract

Talk to us

Similar Papers

More From: Journal of Parallel and Distributed Computing