Message Passing Interface Research Articles

Phylogenies are central to many research areas in biology and commonly estimated using likelihood-based methods. Unfortunately, any likelihood-based method, including Bayesian inference, can be restrictively slow for large datasets-with many taxa and/or many sites in the sequence alignment-or complex substitutions models. The primary limiting factor when using large datasets and/or complex models in probabilistic phylogenetic analyses is the likelihood calculation, which dominates the total computation time. To address this bottleneck, we incorporated the high-performance phylogenetic library BEAGLE into RevBayes, which enables multi-threading on multi-core CPUs and GPUs, as well as hardware specific vectorized instructions for faster likelihood calculations. Our new implementation of RevBayes+BEAGLE retains the flexibility and dynamic nature that users expect from vanilla RevBayes. In addition, we implemented native parallelization within RevBayes without an external library using the message passing interface (MPI); RevBayes+MPI. We evaluated our new implementation of RevBayes+BEAGLE using multi-threading on CPUs and 2 different powerful GPUs (NVidia Titan V and NVIDIA A100) against our native implementation of RevBayes+MPI. We found good improvements in speedup when multiple cores were used, with up to 20-fold speedup when using multiple CPU cores and over 90-fold speedup when using multiple GPU cores. The improvement depended on the data type used, DNA or amino acids, and the size of the alignment, but less on the size of the tree. We additionally investigated the cost of rescaling partial likelihoods to avoid numerical underflow and showed that unnecessarily frequent and inefficient rescaling can increase runtimes up to 4-fold. Finally, we presented and compared a new approach to store partial likelihoods on branches instead of nodes that can speed up computations up to 1.7 times but comes at twice the memory requirements.

Abstract. We discuss the various performance aspects of parallelizing our transient global-scale groundwater model at 30′′ resolution (30 arcsec; ∼ 1 km at the Equator) on large distributed memory parallel clusters. This model, referred to as GLOBGM, is the successor of our 5′ (5 arcmin; ∼ 10 km at the Equator) PCR-GLOBWB 2 (PCRaster Global Water Balance model) groundwater model, based on MODFLOW having two model layers. The current version of GLOBGM (v1.0) used in this study also has two model layers, is uncalibrated, and uses available 30′′ PCR-GLOBWB data. Increasing the model resolution from 5′ to 30′′ creates challenges, including increased runtime, memory usage, and data storage that exceed the capacity of a single computer. We show that our parallelization tackles these problems with relatively low parallel hardware requirements to meet the needs of users or modelers who do not have exclusive access to hundreds or thousands of nodes within a supercomputer. For our simulation, we use unstructured grids and a prototype version of MODFLOW 6 that we have parallelized using the message-passing interface. We construct independent unstructured grids with a total of 278 million active cells to cancel all redundant sea and land cells, while satisfying all necessary boundary conditions, and distribute them over three continental-scale groundwater models (168 million – Afro–Eurasia; 77 million – the Americas; 16 million – Australia) and one remaining model for the smaller islands (17 million). Each of the four groundwater models is partitioned into multiple non-overlapping submodels that are tightly coupled within the MODFLOW linear solver, where each submodel is uniquely assigned to one processor core, and associated submodel data are written in parallel during the pre-processing, using data tiles. For balancing the parallel workload in advance, we apply the widely used METIS graph partitioner in two ways: it is straightforwardly applied to all (lateral) model grid cells, and it is applied in an area-based manner to HydroBASINS catchments that are assigned to submodels for pre-sorting to a future coupling with surface water. We consider an experiment for simulating the years 1958–2015 with daily time steps and monthly input, including a 20-year spin-up, on the Dutch national supercomputer Snellius. Given that the serial simulation would require ∼ 4.5 months of runtime, we set a hypothetical target of a maximum of 16 h of simulation runtime. We show that 12 nodes (32 cores per node; 384 cores in total) are sufficient to achieve this target, resulting in a speedup of 138 for the largest Afro–Eurasia model when using 7 nodes (224 cores) in parallel. A limited evaluation of the model output using the United States Geological Survey (USGS) National Water Information System (NWIS) head observations for the contiguous United States was conducted. This showed that increasing the resolution from 5′ to 30′′ results in a significant improvement with GLOBGM for the steady-state simulation when compared to the 5′ PCR-GLOBWB groundwater model. However, results for the transient simulation are quite similar, and there is much room for improvement. Monthly and multi-year total terrestrial water storage anomalies derived from the GLOBGM and PCR-GLOBWB models, however, compared favorably with observations from the GRACE satellite. For the next versions of GLOBGM, further improvements require a more detailed (hydro)geological schematization and better information on the locations, depths, and pumping rates of abstraction wells.

Message Passing Interface Research Articles

Related Topics

Articles published on Message Passing Interface

Bayesian Phylogenetic Analysis on Multi-Core Compute Architectures: Implementation and Evaluation of BEAGLE in RevBayes With MPI.

Structure-exploiting interior-point solver for high-dimensional entropy-sparsified regression learning

GLOBGM v1.0: a parallel implementation of a 30 arcsec PCR-GLOBWB-MODFLOW global-scale groundwater model

Parallel computation to bidimensional heat equation using MPI/CUDA and FFTW package

Test data generation for covering mutation-based path using MGA for MPI program

A GPU-ready pseudo-spectral method for direct numerical simulations of multiphase turbulence

Predicting Software Defects in Hybrid MPI and OpenMP Parallel Programs Using Machine Learning

A scalable parallel computing method for autonomous platoons

Program partitioning and deadlock analysis for MPI based on logical clocks

Speed Up of Volumetric Non-Local Transform-Domain Filter Utilising HPC Architecture.

Stochastic Gradient Descent for matrix completion: Hybrid parallelization on shared- and distributed-memory systems

Gaussian approximation potentials: Theory, software implementation and application examples.

An Architecture for a Tri-Programming Model-Based Parallel Hybrid Testing Tool

Finding Bottlenecks in Message Passing Interface Programs by Scalable Critical Path Analysis

An efficient watermarking algorithm for digital audio data in security applications

A massive MPI parallel framework of smoothed particle hydrodynamics with optimized memory management for extreme mechanics problems

Two different parallel approaches for a hybrid fractional order Coronavirus model

Scalable simulation of coupled adsorption and transport of methane in confined complex porous media with density preconditioning

Numerical investigation of coaxial turbulent jet

Comparing the Performance of Julia on CPUs versus GPUs and Julia-MPI versus Fortran-MPI: a case study with MPAS-Ocean (Version 7.1)

Lead the way for us

Editage

Paperpal

R Discovery

Mind the Graph

Message Passing Interface Research Articles

Related Topics

Articles published on Message Passing Interface

Bayesian Phylogenetic Analysis on Multi-Core Compute Architectures: Implementation and Evaluation of BEAGLE in RevBayes With MPI.

Structure-exploiting interior-point solver for high-dimensional entropy-sparsified regression learning

GLOBGM v1.0: a parallel implementation of a 30 arcsec PCR-GLOBWB-MODFLOW global-scale groundwater model

Parallel computation to bidimensional heat equation using MPI/CUDA and FFTW package

Test data generation for covering mutation-based path using MGA for MPI program

A GPU-ready pseudo-spectral method for direct numerical simulations of multiphase turbulence

Predicting Software Defects in Hybrid MPI and OpenMP Parallel Programs Using Machine Learning

A scalable parallel computing method for autonomous platoons

Program partitioning and deadlock analysis for MPI based on logical clocks

Speed Up of Volumetric Non-Local Transform-Domain Filter Utilising HPC Architecture.

Stochastic Gradient Descent for matrix completion: Hybrid parallelization on shared- and distributed-memory systems

Gaussian approximation potentials: Theory, software implementation and application examples.

An Architecture for a Tri-Programming Model-Based Parallel Hybrid Testing Tool

Finding Bottlenecks in Message Passing Interface Programs by Scalable Critical Path Analysis

An efficient watermarking algorithm for digital audio data in security applications

A massive MPI parallel framework of smoothed particle hydrodynamics with optimized memory management for extreme mechanics problems

Two different parallel approaches for a hybrid fractional order Coronavirus model

Scalable simulation of coupled adsorption and transport of methane in confined complex porous media with density preconditioning

Numerical investigation of coaxial turbulent jet

Comparing the Performance of Julia on CPUs versus GPUs and Julia-MPI versus Fortran-MPI: a case study with MPAS-Ocean (Version 7.1)