3D DFT by block tensor-matrix multiplication via a modified Cannon's algorithm: Implementation and scaling on distributed-memory clusters with fat tree networks

Nitin Malapally,Viacheslav Bolnykh,Estela Suarez,Paolo Carloni,Thomas Lippert,Davide Mandelli

doi:10.1016/j.jpdc.2024.104945

Abstract

A known scalability bottleneck of the parallel 3D FFT is its use of all-to-all communications. Here, we present S3DFT, a library that circumvents this by using point-to-point communication – albeit at a higher arithmetic complexity. This approach exploits three variants of Cannon's algorithm with adaptations for block tensor-matrix multiplications. We demonstrate S3DFT's efficient use of hardware resources, and its scaling using up to 16,464 cores of the JUWELS Cluster. However, in a comparison with well-established 3D FFT libraries, its parallel efficiency and performance were found to fall behind. A detailed analysis identifies the cause in two of its component algorithms, which scale poorly owing to how their communication patterns are mapped in subsets of the fat tree topology. This result exposes a potential drawback of running block-wise parallel algorithms on systems with fat tree networks caused by increased communication latencies along specific directions of the mesh of processing elements.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

3D DFT by block tensor-matrix multiplication via a modified Cannon's algorithm: Implementation and scaling on distributed-memory clusters with fat tree networks

Abstract

Talk to us

Similar Papers

More From: Journal of Parallel and Distributed Computing

Lead the way for us

Journal: Journal of Parallel and Distributed Computing	Publication Date: Jun 28, 2024
License type: cc-by-nc

Similar Papers

Malta: Multi-Agent Reinforcement Learning for Differentiated Services in Fat Tree Networks
Ajay Kattepur ... Sushanth David
-
Ajay Kattepur, et. al.Ajay Kattepur ... Sushanth David
09 Nov 2021
09 Nov 2021

Opportunities and limitations of Quality-of-Service in Message Passing applications on adaptively routed Dragonfly and Fat Tree networks
Jeremiah J Wilke ... Joseph P Kenny
-
Jeremiah J Wilke, et. al.Jeremiah J Wilke ... Joseph P Kenny
01 Sep 2020
01 Sep 2020

Quantification of Regression Test Suite Execution Time in Parallel Execution Setup with Weighted Test Suite Split Algorithm
-
Journal of Sensor Networks and Data Communications | VOL. 4
--
02 Feb 2024
Journal of Sensor Networks and Data Communications | VOL. 4

Quantification of Regression Test Suite Execution Time in Parallel Execution Setup with Weighted Test Suite Split Algorithm
-
Journal of Sensor Networks and Data Communications | VOL. 4
--
02 Feb 2024
Journal of Sensor Networks and Data Communications | VOL. 4

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

3D DFT by block tensor-matrix multiplication via a modified Cannon's algorithm: Implementation and scaling on distributed-memory clusters with fat tree networks

Abstract

Talk to us

Similar Papers

More From: Journal of Parallel and Distributed Computing