Abstract

Abstract. The Global Nested Air Quality Prediction Modeling System (GNAQPMS) is the global version of the Nested Air Quality Prediction Modeling System (NAQPMS), which is a multi-scale chemical transport model used for air quality forecast and atmospheric environmental research. In this study, we present the porting and optimisation of GNAQPMS on a second-generation Intel Xeon Phi processor, codenamed Knights Landing (KNL). Compared with the first-generation Xeon Phi coprocessor (codenamed Knights Corner, KNC), KNL has many new hardware features such as a bootable processor, high-performance in-package memory and ISA compatibility with Intel Xeon processors. In particular, we describe the five optimisations we applied to the key modules of GNAQPMS, including the CBM-Z gas-phase chemistry, advection, convection and wet deposition modules. These optimisations work well on both the KNL 7250 processor and the Intel Xeon E5-2697 V4 processor. They include (1) updating the pure Message Passing Interface (MPI) parallel mode to the hybrid parallel mode with MPI and OpenMP in the emission, advection, convection and gas-phase chemistry modules; (2) fully employing the 512 bit wide vector processing units (VPUs) on the KNL platform; (3) reducing unnecessary memory access to improve cache efficiency; (4) reducing the thread local storage (TLS) in the CBM-Z gas-phase chemistry module to improve its OpenMP performance; and (5) changing the global communication from writing/reading interface files to MPI functions to improve the performance and the parallel scalability. These optimisations greatly improved the GNAQPMS performance. The same optimisations also work well for the Intel Xeon Broadwell processor, specifically E5-2697 v4. Compared with the baseline version of GNAQPMS, the optimised version was 3.51 × faster on KNL and 2.77 × faster on the CPU. Moreover, the optimised version ran at 26 % lower average power on KNL than on the CPU. With the combined performance and energy improvement, the KNL platform was 37.5 % more efficient on power consumption compared with the CPU platform. The optimisations also enabled much further parallel scalability on both the CPU cluster and the KNL cluster scaled to 40 CPU nodes and 30 KNL nodes, with a parallel efficiency of 70.4 and 42.2 %, respectively.

Highlights

  • Insatiable computing demand is driven by the ever-increasing scientific demands in many research codes such as the climate model Community Earth System Model (CESM) and the weather model Weather Research and Forecasting Model (WRF)

  • The global chemistry transport model Global Nested Air Quality Prediction Modeling System (GNAQPMS) was optimised to run on the Intel secondgeneration MIC architecture KNL processor and accelerate its modules

  • The tests of Opt-V GNAQPMS were conducted on the latest Xeon E5-2697 V4 and KNL 7250 clusters

Read more

Summary

Introduction

Insatiable computing demand is driven by the ever-increasing scientific demands in many research codes such as the climate model Community Earth System Model (CESM) and the weather model Weather Research and Forecasting Model (WRF). Mielikainen et al (2014a, b, c, 2015a, b) did a series of works to transplant the physical schemes to the KNC platform in WRF, including the Goddard microphysics scheme, the Thompson microphysics scheme, the Goddard shortwave radiation scheme and the advection scheme in the model dynamic core Among these works, the Goddard microphysics scheme (Tao and Simpson, 1993; Khain et al, 2003) got a 4.7 × speedup on KNC and a 2.8 × speedup on the CPU compared with its baseline version, and sharing the same modern hardware features led to a speedup on both the MIC and the CPU platform.

Model and KNL description
Model description of GNAQPMS
KNL description
Baseline performance test
Optimisation technology
Main optimisation methods
Global communication
Emission process section and typical vectorisation
CBM-Z gas-phase chemistry section
Diffusion and wet deposition section
Performance evaluation
Platform setup
Validation of the model results
Speedup performance
Scalability on a cluster
Findings
Conclusions
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call