Abstract

Abstract. We present an implementation of parallel I∕O in the Modular Ocean Model (MOM), a numerical ocean model used for climate forecasting, and determine its optimal performance over a range of tuning parameters. Our implementation uses the parallel API of the netCDF library, and we investigate the potential bottlenecks associated with the model configuration, netCDF implementation, the underpinning MPI-IO library/implementations and Lustre filesystem. We investigate the performance of a global 0.25∘ resolution model using 240 and 960 CPUs. The best performance is observed when we limit the number of contiguous I∕O domains on each compute node and assign one MPI rank to aggregate and to write the data from each node, while ensuring that all nodes participate in writing this data to our Lustre filesystem. These best-performance configurations are then applied to a higher 0.1∘ resolution global model using 720 and 1440 CPUs, where we observe even greater performance improvements. In all cases, the tuned parallel I∕O implementation achieves much faster write speeds relative to serial single-file I∕O, with write speeds up to 60 times faster at higher resolutions. Under the constraints outlined above, we observe that the performance scales as the number of compute nodes and I∕O aggregators are increased, ensuring the continued scalability of I∕O-intensive MOM5 model runs that will be used in our next-generation higher-resolution simulations.

Highlights

  • Optimal performance of a computational science model requires efficient numerical methods that are facilitated by the computational resources of the high performance computing (HPC) platform

  • This is true in highly parallelized models on HPC cluster systems, where the calculations are distributed across multiple compute nodes, often with strong data dependencies between the individual processes

  • We focus on a parallel I/O implementation for the Modular Ocean Model (MOM), the principal ocean model of the Geophysical Fluid Dynamics Laboratory (GFDL) (Griffies, 2012)

Read more

Summary

Introduction

Optimal performance of a computational science model requires efficient numerical methods that are facilitated by the computational resources of the high performance computing (HPC) platform. A library based on MPI-IO can use MPI message passing to coordinate data transfer across processes and can reshape data transfers to optimally match the available bandwidth and number of physical disks provided by a parallel filesystem such as Lustre (Howison et al, 2010) This eliminates the need for writer PEs to allocate large amounts of memory and avoids any unnecessary postprocessing of fragmented datasets into single files, while presenting the possibility of efficient, scalable I/O performance when writing to a parallel filesystem. The stripe size should generally match the data block size of I/O operations (Turner and McIntosh-Smith, 2017); we find that the stripe size had limited effects on the write performance, and the default 1 MiB gave satisfactory I/O performance in our preselection process

Configurations
Benchmark results
Stripe count and aggregators
Load balance
Serial read and parallel read
Findings
Conclusions
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call