Abstract

China Meteorological Administration (CMA) has a long history of using High Performance Computing System (HPCS) for over three decades. CMA HPCS investment provides reliable HPC capabilities essential to run Numerical Weather Prediction (NWP) models and climate models, generating millions of weather guidance products daily and providing support for Coupled Model Inter-comparison Project Phase 5 (CMIP5). Monitoring the HPCS and analyzing the resource usage can improve the performance and reliability for our users, which require a good understanding of failure characteristics. Large-scale studies of failures in real production systems are scarce. This paper collects, analyzes and studies all the failures occurring during the HPC operation period, especially focusing on studying the relationship between HPCS and NWP applications. Also, we present the challenges for a more effective monitoring system development and summarize the useful maintenance strategies. This step may have considerable effects on the performance of online failure prediction of HPC and better performance in future.

Highlights

  • China Meteorological Administration (CMA) High Performance Computing System (HPCS) is managed by National Meteorological Information Center (NMIC) at CMA in Beijing, China

  • The CMA HPCS are served for all the users in the CMA campus and other provincial bureaus, including users from National Climate Center and Numerical Weather Prediction Center and other operational centers

  • We analyze a dataset of 44-month CMA HPCS workload traces and investigate the users’ waiting patterns, which include all the jobs submitted from Nov. 2013 to Jun. 2017

Read more

Summary

High Performance Computing System in CMA

CMA HPCS is managed by National Meteorological Information Center (NMIC) at CMA in Beijing, China. There are two national subsystems and other 7 regional subsystems locating in different provincial meteorological bureaus. The supercomputer is an IBM Flex P460 system, consisting of 1786 compute nodes with 57,152 compute cores, and 5.4 petabytes storage. There are 560 computer nodes with 17,920 cores, 77 terabytes of memory and 1730 terabytes storage. Each compute node contains 4 Power compute cores (3.3 GHz). The compute clusters and storage clusters are connected by InfiniBand, with the speed of 160 Gb/s for either way. The CMA HPCS are served for all the users in the CMA campus and other provincial bureaus, including users from National Climate Center and Numerical Weather Prediction Center and other operational centers

CMA HPCS Performance
CMA HPC Operation Monitoring System
Challenges
Solutions and Maintenance Strategies
Findings
Concluding Remarks
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call