Abstract

This paper contains two parts revolving around Monte Carlo transport simulation on Intel Many Integrated Core coprocessors (MIC, also known as Xeon Phi). (1) MCNP 6.1 was recompiled into multithreading (OpenMP) and multiprocessing (MPI) forms respectively without modification to the source code. The new codes were tested on a 60-core 5110P MIC. The test case was FS7ONNi, a radiation shielding problem used in MCNP’s verification and validation suite. It was observed that both codes became slower on the MIC than on a 6-core X5650 CPU, by a factor of ~4 for the MPI code and, abnormally, ~20 for the OpenMP code, and both exhibited limited capability of strong scaling. (2) We have recently added a Constructive Solid Geometry (CSG) module to our ARCHER code to provide better support for geometry modelling in radiation shielding simulation. The functions of this module are frequently called in the particle random walk process. To identify the performance bottleneck we developed a CSG proxy application and profiled the code using the geometry data from FS7ONNi. The profiling data showed that the code was primarily memory latency bound on the MIC. This study suggests that despite low initial porting e_ort, Monte Carlo codes do not naturally lend themselves to the MIC platform — just like to the GPUs, and that the memory latency problem needs to be addressed in order to achieve decent performance gain.

Highlights

  • This study suggests that despite low initial porting effort, Monte Carlo codes do not naturally lend themselves to the Many Integrated Core coprocessors (MICs) platform — just like to the Graphics Processing Units (GPUs), and that the memory latency problem needs to be addressed in order to achieve decent performance gain

  • In recent years hardware acceleration using Many Integrated Core coprocessors (MICs) made by Intel or Graphics Processing Units (GPUs) by Nvidia has become increasingly common in scientific computing

  • Examples include the development of “CUDA runtime Application Programming Interface (API)” built on the original low-level driver API, which significantly reduces the amount of boilerplate codes and improves readability, “unified memory” which carries the burden of memory management to some extent by eliminating the need for explicit data copy

Read more

Summary

Introduction

In recent years hardware acceleration using Many Integrated Core coprocessors (MICs) made by Intel or Graphics Processing Units (GPUs) by Nvidia has become increasingly common in scientific computing. Two specific questions from developers are: 1 how hard is it to port existing codes to accelerators, how good is the performance and what is the bottleneck, 2 how hard is it to perform acceleratorspecific optimization?. The MICs (Knights Corner generation) and GPUs (Kepler and Maxwell generations) are not binary compatible with the CPUs, which means existing programs cannot directly run on accelerators. For GPUs, the codes need to be rewritten in Nvidia’s GPU-specific Application Programming Interface (API) called CUDA [3]. Alternative languages do exist, such as the compiler directive based OpenACC [4] and new version of OpenMP (> 4.0) [5], to facilitate code porting at the cost of less functionality and lower performance than CUDA

Methods
Results
Discussion
Conclusion
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.