Abstract
The Czech national HPC center IT4Innovations located in Ostrava provides two HPC systems, Anselm and Salomon. The Salomon HPC is amongst the hundred most powerful supercomputers on Earth since its commissioning in 2015. Both clusters were tested for usage by the ATLAS experiment for running simulation jobs. Several thousand core hours were allocatedto the project for tests, but the main aim is to use free resources waitigfor large parallel jobs of other users. Multiple strategies for ATLAS job execution were tested on the Salomon and Anselm HPCs. The solution described herein is based on the ATLAS experience with other HPC sites. ARC Compute Element (ARCCE) installed at the grid site in Prague is used for job submission to Salomon. The ATLAS production system submits jobs to the ARC-Evia ARC Control Tower (aCT). The ARC-CE processes job requirements from aTand creates a script for a batch system which is then executed via ssh. Sshfs is used to share scripts and input files between the site and the HPC cluster. The software used to run jobs is rsynced from the site's CVMFS installation to the HPC's scratch space every day to ensure availabiliy of recent software. Using this setting, opportunistic capacity of the Salomon HPC was exploited.
Highlights
The Czech National Supercomputer Center IT4Innovations in Ostrava operates the Salomon HPC system which is the most powerful computer in the Czech Republic; it is listed in Top500 and ranked 87th in the world as of November 2017 [1]
Salomon was built in 2015, providing 2 PFLOPs in peak performance. It consists of 1008 computational nodes with 24 cores of Intel Xeon E5 CPUs and 128 GB of RAM per node interconnected with Infiniband (56 Gbps)
Jobs waiting for resources for a long time are killed in the batch system and reassigned by ATLAS production system to another site to ensure timely completion of tasks
Summary
The Czech National Supercomputer Center IT4Innovations in Ostrava operates the Salomon HPC system which is the most powerful computer in the Czech Republic; it is listed in Top500 and ranked 87th in the world as of November 2017 [1]. Salomon was built in 2015, providing 2 PFLOPs in peak performance. It consists of 1008 computational nodes with 24 cores of Intel Xeon E5 CPUs and 128 GB of RAM per node interconnected with Infiniband (56 Gbps). 432 nodes contain 61 core Intel Xeon Phi accelerators. The ATLAS Experiment in CERN [2] uses the Salomon cluster in opportunistic fashion via the Czech Tier site (praguelcg2) [3]. Non-accelerated nodes are available for opportunistic usage and unlike other opportunistic HPC resources, there is no job pre-emption. Jobs waiting for resources for a long time are killed in the batch system and reassigned by ATLAS production system to another site to ensure timely completion of tasks. The first successful ATLAS job submitted to the Salomon via ARC-CE finished in December 2017 and since Salomon continuously and significantly contributes to Czech Tier ATLAS production output
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.