MPI Code Research Articles

PDF HTML阅读 XML下载导出引用引用提醒基于嵌套循环分类的并行识别技术 DOI: 10.3724/SP.J.1001.2012.04178 作者: 作者单位: 作者简介: 通讯作者: 中图分类号: 基金项目: “核高基”国家科技重大专项(2009ZX01036-001-001-2) Parallelism Recognition Technology Based on Nested Loops Classifying Author: Affiliation: Fund Project: 摘要 | 图/表 | 访问统计 | 参考文献 | 相似文献 | 引证文献 | 资源附件 | 文章评论摘要:传统的分布存储并行编译系统大多是在共享存储并行编译系统的基础上开发的.共享存储并行编译系统的并行识别技术适合 OpenMP 代码生成,实现方式是将所有嵌套循环都按照相同的识别方法进行处理,用于分布存储并行编译系统必然会导致无法高效发掘程序的并行性.分布存储并行编译系统应根据嵌套循环结构的特点进行分类处理,提出适合 MPI 代码生成的并行识别技术.为解决上述问题,根据嵌套循环的结构和 MPI 并行程序的特点,提出了一种新的嵌套循环分类方法,并针对不同的嵌套循环分别提出了相应的并行识别技术.实验结果表明,与采用传统并行识别技术的分布存储并行编译系统相比,按照所提方法对嵌套循环进行分类,采用相应并行识别技术的编译系统能够更高效地识别基准程序中的并行循环,自动生成的 MPI 并行代码其性能加速比提高了 20%以上. Abstract:Existing distributed memory parallelizing compiler systems are mostly developed based on sharedsystems. The parallelism recognition technologies of shared memory parallelizing compiler systems are suitable forOpenMP code generation. Their implementation is used to recognize all nested loops by the same technology, sothat the parallelism cannot be efficiently explored when applying them to distributed memory parallelizing compilersystems. Thus, this paper proposes some parallelism recognition technologies suitable for the MPI code generationfor distributed memory parallelizing compiler systems by classifying the nested loops according to their structures.To solve these problems, a new classification method of nested loops is proposed, according to the structure ofnested loops and characteristics of MPI parallel program. Corresponding parallelism recognition technologies fordifferent nested loops are also presented, respectively. The experimental results show that compared with thedistributed memory parallelizing compiler systems that used existing parallelism recognition technologies, thecompiler systems, which use the proposed classification method and the corresponding recognition technologies,can more efficiently recognize parallel nested loops in the benchmark programs, and the performance speedup of theMPI codes automatically increased to more than 20%. 参考文献相似文献引证文献

We document plain Fortran and Fortran MPI checkerboard code for Markov chain Monte Carlo simulations of pure SU(3) lattice gauge theory with the Wilson action in D dimensions. The Fortran code uses periodic boundary conditions and is suitable for pedagogical purposes and small scale simulations. For the Fortran MPI code two geometries are covered: the usual torus with periodic boundary conditions and the double-layered torus as defined in the paper. Parallel computing is performed on checkerboards of sublattices, which partition the full lattice in one, two, and so on, up to D directions (depending on the parameters set). For updating, the Cabibbo–Marinari heatbath algorithm is used. We present validations and test runs of the code. Performance is reported for a number of currently used Fortran compilers and, when applicable, MPI versions. For the parallelized code, performance is studied as a function of the number of processors.Program summaryProgram title: STMC2LSU3MPICatalogue identifier: AEMJ_v1_0Program summary URL:http://cpc.cs.qub.ac.uk/summaries/AEMJ_v1_0.htmlProgram obtainable from: CPC Program Library, Queen’s University, Belfast, N. IrelandLicensing provisions: Standard CPC licence, http://cpc.cs.qub.ac.uk/licence/licence.htmlNo. of lines in distributed program, including test data, etc.: 26666No. of bytes in distributed program, including test data, etc.: 233126Distribution format: tar.gzProgramming language: Fortran 77 compatible with the use of Fortran 90/95 compilers, in part with MPI extensions.Computer: Any capable of compiling and executing Fortran 77 or Fortran 90/95, when needed with MPI extensions.Operating system:1.Red Hat Enterprise Linux Server 6.1 with OpenMPI + pgf77 11.8-0,2.Centos 5.3 with OpenMPI + gfortran 4.1.2,3.Cray XT4 with MPICH2 + pgf90 11.2-0.Has the code been vectorised or parallelized?: Yes, parallelized using MPI extensions. Number of processors used: 2 to 11664RAM: 200 Mega bytes per process.Classification: 11.5.Nature of problem:Physics of pure SU(3) Quantum Field Theory (QFT). This is relevant for our understanding of Quantum Chromodynamics (QCD). It includes the glueball spectrum, topological properties and the deconfining phase transition of pure SU(3) QFT. For instance, Relativistic Heavy Ion Collision (RHIC) experiments at the Brookhaven National Laboratory provide evidence that quarks confined in hadrons undergo at high enough temperature and pressure a transition into a Quark-Gluon Plasma (QGP). Investigations of its thermodynamics in pure SU(3) QFT are of interest.Solution method:Markov Chain Monte Carlo (MCMC) simulations of SU(3) Lattice Gauge Theory (LGT) with the Wilson action. This is a regularization of pure SU(3) QFT on a hypercubic lattice, which allows approaching the continuum SU(3) QFT by means of Finite Size Scaling (FSS) studies. Specifically, we provide updating routines for the Cabibbo-Marinari heatbath with and without checkerboard parallelization. While the first is suitable for pedagogical purposes and small scale projects, the latter allows for efficient parallel processing. Targetting the geometry of RHIC experiments, we have implemented a Double-Layered Torus (DLT) lattice geometry, which has previously not been used in LGT MCMC simulations and enables inside and outside layers at distinct temperatures, the lower-temperature layer acting as the outside boundary for the higher-temperature layer, where the deconfinement transition goes on.Restrictions: The checkerboard partition of the lattice makes the development of measurement programs more tedious than is the case for an unpartitioned lattice. Presently, only one measurement routine for Polyakov loops is provided.Unusual features:We provide three different versions for the send/receive function of the MPI library, which work for different operating system +compiler +MPI combinations. This involves activating the correct row in the last three rows of our latmpi.par parameter file. The underlying reason is distinct buffer conventions.Running time:For a typical run using an Intel i7 processor, it takes (1.8-6) E-06 seconds to update one link of the lattice, depending on the compiler used. For example, if we do a simulation on a small (4 * 83) DLT lattice with a statistics of 221 sweeps (i.e., update the two lattice layers of 4 * (4 * 83) links each 221 times), the total CPU time needed can be2 * 4 * (4 * 83) * 221 * 3 E-06 seconds = 1.7 minutes,where2— two layers of lattice4— four dimensions83 * 4— lattice size221— sweeps of updating6 E-06 s— average time to update one link variable. If we divide the job into 8 parallel processes, then the real time is (for negligible communication overhead) 1.7 mins / 8 = 0.2 mins.

MPI Code Research Articles

Related Topics

Articles published on MPI Code

A highly optimized skeleton for unbalanced and deep divide-and-conquer algorithms on multi-core clusters

The Pencil Code, a modular MPI code for partial differential equations and particles: multipurpose and multiuser-maintained

Visualization Techniques of Differentially Heated Lid-Driven Square Cavity

Analysis of new visualization techniques of left heated lid‐driven square cavity

ASPECT RATIO EFFECTS ON BOTTOM HEATED 2D CAVITY USING ENERGY STREAMLINES AND FIELD SYNERGY PRINCIPLE

Simulation of Hypersonic-Shock-Wave–Laminar-Boundary-Layer Interaction over Blunt Fin

Analysis Of Field Synergy In Bottom Heated Lid Driven Cubical Cavity

Alchemist: An Apache Spark ⇔ MPI interface

Application Productivity and Performance Evaluation of Transparent Locality-aware One-sided Communication Primitives

Comparing Coarray Fortran (CAF) with MPI for several structured mesh PDE applications

Turbulent Transport Processes at Rough Surfaces with Geophysical Applications

Distributed multiscale computing with MUSCLE 2, the Multiscale Coupling Library and Environment

Performance Improvements for a Large-scale Geological Simulation

Parallel Brownian dynamics simulations with the message-passing and PGAS programming models

Parallelism Recognition Technology Based on Nested Loops Classifying

Fortran code for SU(3) lattice gauge theory with and without MPI checkerboard parallelization

Hands-on Performance Tuning of 3D Finite Difference Earthquake Simulation on GPU Fermi Chipset

Towards a three-dimensional moving body incompressible flow solver with a linear deformable model

Performance analysis of a hybrid MPI/OpenMP application on multi-core clusters

Hybrid Message-Passing and Shared-Memory Programming in a Molecular Dynamics Application On Multicore Clusters

Lead the way for us

Editage

Paperpal

R Discovery

Mind the Graph

MPI Code Research Articles

Related Topics

Articles published on MPI Code

A highly optimized skeleton for unbalanced and deep divide-and-conquer algorithms on multi-core clusters

The Pencil Code, a modular MPI code for partial differential equations and particles: multipurpose and multiuser-maintained

Visualization Techniques of Differentially Heated Lid-Driven Square Cavity

Analysis of new visualization techniques of left heated lid‐driven square cavity

ASPECT RATIO EFFECTS ON BOTTOM HEATED 2D CAVITY USING ENERGY STREAMLINES AND FIELD SYNERGY PRINCIPLE

Simulation of Hypersonic-Shock-Wave–Laminar-Boundary-Layer Interaction over Blunt Fin

Analysis Of Field Synergy In Bottom Heated Lid Driven Cubical Cavity

Alchemist: An Apache Spark ⇔ MPI interface

Application Productivity and Performance Evaluation of Transparent Locality-aware One-sided Communication Primitives

Comparing Coarray Fortran (CAF) with MPI for several structured mesh PDE applications

Turbulent Transport Processes at Rough Surfaces with Geophysical Applications

Distributed multiscale computing with MUSCLE 2, the Multiscale Coupling Library and Environment

Performance Improvements for a Large-scale Geological Simulation

Parallel Brownian dynamics simulations with the message-passing and PGAS programming models

Parallelism Recognition Technology Based on Nested Loops Classifying

Fortran code for SU(3) lattice gauge theory with and without MPI checkerboard parallelization

Hands-on Performance Tuning of 3D Finite Difference Earthquake Simulation on GPU Fermi Chipset

Towards a three-dimensional moving body incompressible flow solver with a linear deformable model

Performance analysis of a hybrid MPI/OpenMP application on multi-core clusters

Hybrid Message-Passing and Shared-Memory Programming in a Molecular Dynamics Application On Multicore Clusters