High Performance Computing Power Research Articles

Abstract. The data volume produced by regional and global multicomponent Earth system models is rapidly increasing because of the improved spatial and temporal resolution of the model components and the sophistication of the numerical models regarding represented physical processes and their complex non-linear interactions. In particular, very small time steps need to be defined in non-hydrostatic high-resolution modeling applications to represent the evolution of the fast-moving processes such as turbulence, extratropical cyclones, convective lines, jet streams, internal waves, vertical turbulent mixing and surface gravity waves. Consequently, the employed small time steps cause extra computation and disk input–output overhead in the modeling system even if today's most powerful high-performance computing and data storage systems are considered. Analysis of the high volume of data from multiple Earth system model components at different temporal and spatial resolutions also poses a challenging problem to efficiently perform integrated data analysis of the massive amounts of data when relying on the traditional postprocessing methods today. This study mainly aims to explore the feasibility and added value of integrating existing in situ visualization and data analysis methods within the model coupling framework. The objective is to increase interoperability between Earth system multicomponent code and data-processing systems by providing an easy-to-use, efficient, generic and standardized modeling environment. The new data analysis approach enables simultaneous analysis of the vast amount of data produced by multicomponent regional Earth system models during the runtime. The presented methodology also aims to create an integrated modeling environment for analyzing fast-moving processes and their evolution both in time and space to support a better understanding of the underplaying physical mechanisms. The state-of-the-art approach can also be employed to solve common problems in the model development cycle, e.g., designing a new subgrid-scale parameterization that requires inspecting the integrated model behavior at a higher temporal and spatial scale simultaneously and supporting visual debugging of the multicomponent modeling systems, which usually are not facilitated by existing model coupling libraries and modeling systems.

Read full abstract

The solution of large eigenproblems is involved in many scientific and engineering applications when for instance, stability analysis is a concern. For large simulation in material physics or thermo-acoustics, the calculation can last for many hours on large parallel platforms. On future large-scale systems, the mean time between failures (MTBF) of the system is expected to decrease so that many faults could occur during the solution of large eigenproblems. Consequently, it becomes critical to design parallel eigensolvers that can survive faults. In that framework, we investigate the relevance of approaches relying on numerical techniques, which might be combined with more classical techniques for real large-scale parallel implementations. Because we focus on numerical remedies we do not consider parallel implementations nor parallel experiments but only numerical experiments. We assume that a separate mechanism ensures the fault detection and that a system layer provides support for setting back the environment (processes,. . .) in a running state. Once the system is in a running state, after a fault, our main objective is to provide robust resilient schemes so that the eigensolver may keep converging in the presence of the fault without restarting the calculation from scratch. For this purpose, we extend the interpolation-restart (IR) strategies initially introduced for the solution of linear systems in a previous work to the solution of eigenproblems in this paper. For a given numerical scheme, the IR strategies consist of extracting relevant spectral information from available data after a fault. After data extraction, a well-selected part of the missing data is regenerated through interpolation strategies to constitute a meaningful input to restart the numerical algorithm. One of the main features of this numerical remedy is that it does not require extra resources, i.e., computational unit or computing time, when no fault occurs. In this paper, we revisit a few state-of-the-art methods for solving large sparse eigenvalue problems namely the Arnoldi methods, subspace iteration methods and the Jacobi-Davidson method, in the light of our IR strategies. For each considered eigensolver, we adapt the IR strategies to regenerate as much spectral information as possible. Through extensive numerical experiments, we study the respective robustness of the resulting resilient schemes with respect to the MTBF and to the amount of data loss via qualitative and quantitative illustrations. 1. Introduction. The computation of eigenpairs (eigenvalues and eigenvectors) of large sparse matrices is involved in many scientific and engineering applications such as when stability analysis is a concern. To name a few, it appears in structural dynamics, thermodynamics, thermo-acoustics, quantum chemistry. With the permanent increase of the computational power of high performance computing (HPC) systems by using a larger and larger number of CPU cores or specialized processing units, HPC applications are increasingly prone to faults. To guarantee fault tolerance, two classes of strategies are required. One for the fault detection and the other for fault correction. Faults such as computational node crashes are obvious to detect while silent faults may be challenging to detect. To cope with silent faults, a duplication strategy is commonly used for fault detection [18, 39] by comparing the outputs, while triple modular redundancy (TMR) is used for fault detection and correction [34, 37]. However, the additional computational resources required by such replication strategies may represent a severe penalty. Instead of replicating computational resources, studies [7, 36] propose a time redundancy model for fault detection. It consists in repeating computation twice on the same resource. The advantage of time redundancy models is the flexibility at application level; software developers can indeed select only a set of critical instructions to protect. Recomputing only some instructions instead of the whole application lowers the time redundancy overhead [25]. In some numerical simulations, data naturally satisfy well defined mathematical properties. These properties can be efficiently exploited for fault detection through a periodical check of the numerical properties during computation [10]. Checkpoint/restart is the most studied fault recovery strategy in the context of HPC systems. The common checkpoint/restart scheme consists in periodically saving data onto a reliable storage device such as a remote disk. When a fault occurs, a rollback is performed to the point of the most recent and consistent checkpoint. According to the implemented checkpoint strategy, all processes

Read full abstract

High Performance Computing Power Research Articles

Related Topics

Articles published on High Performance Computing Power

An efficient heterogeneous parallel algorithm of the 3D MOC for multizone heterogeneous systems

Dynamic GPU power capping with online performance tracing for energy efficient GPU computing using DEPO tool

SwMPAS-A: Scaling MPAS-A to 39 Million Heterogeneous Cores on the New Generation Sunway Supercomputer

High-Performance Statistical Computing in the Computing Environments of the 2020s.

The effect of appendages on ship resistance

Artificial Neural Network‐Based Method for Seismic Analysis of Concrete‐Filled Steel Tube Arch Bridges

Interface capturing simulations of bubble population effects in PWR subchannels

Goodbye, motherboard. Bare chiplets bonded to silicon will make computers smaller and more powerful: Hello, silicon-interconnect fabric

Heterogeneous parallel algorithm design and performance optimization for WENO on the Sunway Taihulight supercomputer

Toward modular in situ visualization in Earth system models: the regional modeling system RegESM 1.1

Parallel clustering of single cell transcriptomic data with split-merge sampling on Dirichlet process mixtures.

Research on Distributed File System Framework Based on Solid State Drive

Aiding Cascading Analysis Modelling with High-performance-computing Technology

Roadmap and research issues of multiagent social simulation using high-performance computing

WatCache: a workload-aware temporary cache on the compute side of HPC systems

Coupled hydro-meteorological modelling on a HPC platform for high-resolution extreme weather impact study

Predicting permeability tensors of foams using vector kinetic method

BEAM: A Computational Workflow System for Managing and Modeling Material Characterization Data in HPC Environments

Interpolation-Restart Strategies for Resilient Eigensolvers

Leveraging the Power of High Performance Computing for Next Generation Sequencing Data Analysis: Tricks and Twists from a High Throughput Exome Workflow

Lead the way for us

Editage

Paperpal

R Discovery

Mind the Graph

High Performance Computing Power Research Articles

Related Topics

Articles published on High Performance Computing Power

An efficient heterogeneous parallel algorithm of the 3D MOC for multizone heterogeneous systems

Dynamic GPU power capping with online performance tracing for energy efficient GPU computing using DEPO tool

SwMPAS-A: Scaling MPAS-A to 39 Million Heterogeneous Cores on the New Generation Sunway Supercomputer

High-Performance Statistical Computing in the Computing Environments of the 2020s.

The effect of appendages on ship resistance

Artificial Neural Network‐Based Method for Seismic Analysis of Concrete‐Filled Steel Tube Arch Bridges

Interface capturing simulations of bubble population effects in PWR subchannels

Goodbye, motherboard. Bare chiplets bonded to silicon will make computers smaller and more powerful: Hello, silicon-interconnect fabric

Heterogeneous parallel algorithm design and performance optimization for WENO on the Sunway Taihulight supercomputer

Toward modular in situ visualization in Earth system models: the regional modeling system RegESM 1.1

Parallel clustering of single cell transcriptomic data with split-merge sampling on Dirichlet process mixtures.

Research on Distributed File System Framework Based on Solid State Drive

Aiding Cascading Analysis Modelling with High-performance-computing Technology

Roadmap and research issues of multiagent social simulation using high-performance computing

WatCache: a workload-aware temporary cache on the compute side of HPC systems

Coupled hydro-meteorological modelling on a HPC platform for high-resolution extreme weather impact study

Predicting permeability tensors of foams using vector kinetic method

BEAM: A Computational Workflow System for Managing and Modeling Material Characterization Data in HPC Environments

Interpolation-Restart Strategies for Resilient Eigensolvers

Leveraging the Power of High Performance Computing for Next Generation Sequencing Data Analysis: Tricks and Twists from a High Throughput Exome Workflow