Abstract

This paper proposes a new scheme for solving data size requirements for a large-scale stencil computation, which are greater than the total size of the main memories of nodes in a cluster. It utilizes distributed flash SSDs over cluster nodes as an extension to the main memory with a locality-aware algorithm. Three algorithms with a different hierarchical blocking scheme for three memory tiers, namely, flash SSD, DRAM, and cache, are proposed, and they are evaluated in different platforms and flash devices. They utilize not only highly parallel asynchronous input/output in flash SSDs, but also appropriate blocking parameters by using an auto-tuning system named Blk-Tune. They also overcome the performance degradation caused by the non-uniform memory architecture (NUMA). The optimized algorithms for single nodes are extended for multi-nodes and evaluated in a cluster with traditional SATA SSDs, as well as with state-of-the-art flash devices, such as low-power and cost-effective M.2 NVMe flash SSDs. With the use of our scheme and distributed flash devices in a cluster, large-scale stencil problems can be solved with a limited number of nodes and a moderate size of main memories.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call