HPC Clusters Research Articles

Building a distributed deep learning (DDL) system on HPC clusters that guarantees convergence speed and scalability for the training of DNNs is challenging. The HPC cluster, which consists of multiple high-density multi-GPU servers connected by the Infiniband network (HDGib), compresses the computing and communication time for distributed DNNs' training but brings new challenges. The convergence time is far from linear scalability (with respect to the number of workers) for parallel DNNs training. We thus analyze the optimization process and identify three key issues that cause scalability degradation. First, the high-frequency update for parameters due to the compression of the computing and communication times exacerbates the stale gradient problem, which slows down the convergence. Second, the previous methods used to constrain the gradient noise (stochastic error) of the SGD are outdated, as HDGib clusters can support more strict constraints due to the Infiniband network connections, which can further constrain the stochastic error. Third, the same learning rate for all workers is inefficient due to the different training stages of each worker. We thus propose a momentum-driven adaptive synchronization model that focuses on solving the above issues and accelerating the training procedure on HDGib clusters. Our adaptive k-synchronization algorithm uses the momentum term to absorb the stale gradients and adaptively bind the stochastic error to provide an approximate optimal descent direction for the distributed SGD. Our model also includes an individual dynamic learning rate search method for each worker to further improve training performance. Compared with previous linear and exponent decay methods, it can provide a more precise descent distance for distributed SGD based on different training stages. Extensive experimental results indicate that the proposed model effectively improves the training performance of CNNs, which retains high accuracy with a speed-up of up to 57.76% and 125.3% on the CPU-based and GPU-based clusters, respectively.

Read full abstract

In this paper, we present a novel methodology for predicting job resources (memory and time) for submitted jobs on HPC systems. Our methodology based on historical jobs data (saccount data) provided from the Slurm workload manager using supervised machine learning. This Machine Learning (ML) prediction model is effective and useful for both HPC administrators and HPC users. Moreover, our ML model increases the efficiency and utilization for HPC systems, thus reduce power consumption as well. Our model involves using Several supervised machine learning discriminative models from the scikit-learn machine learning library and LightGBM applied on historical data from Slurm.Our model helps HPC users to determine the required amount of resources for their submitted jobs and make it easier for them to use HPC resources efficiently. This work provides the second step towards implementing our general open source tool towards HPC service providers. For this work, our Machine learning model has been implemented and tested using two HPC providers, an XSEDE service provider (University of Colorado-Boulder (RMACC Summit) and Kansas State University (Beocat)).We used more than two hundred thousand jobs: one-hundred thousand jobs from SUMMIT and one-hundred thousand jobs from Beocat, to model and assess our ML model performance. In particular we measured the improvement of running time, turnaround time, average waiting time for the submitted jobs; and measured utilization of the HPC clusters.Our model achieved up to 86% accuracy in predicting the amount of time and the amount of memory for both SUMMIT and Beocat HPC resources. Our results show that our model helps dramatically reduce computational average waiting time (from 380 to 4 hours in RMACC Summit and from 662 hours to 28 hours in Beocat); reduced turnaround time (from 403 to 6 hours in RMACC Summit and from 673 hours to 35 hours in Beocat); and acheived up to 100% utilization for both HPC resources.

Read full abstract

HPC Clusters Research Articles

Related Topics

Articles published on HPC Clusters

Analysis of workflow schedulers in simulated distributed environments

Benchmark study of 2D and 3D VOF simulations of a simplex nozzle using a hybrid RANS-LES approach

Toward highly parallel loading of unstructured meshes

A Head Loss Pressure Boundary Condition for Hydraulic Systems

Quantifying the health effects of exposure to non-exhaust road emissions using agent-based modelling (ABM)

Design considerations for workflow management systems use in production genomics research and the clinic

JUSUF: Modular Tier-2 Supercomputing and Cloud Infrastructure at Jülich Supercomputing Centre

Generating UAV high-resolution topographic data within a FOSS photogrammetric workflow using high-performance computing clusters

Momentum-driven adaptive synchronization model for distributed DNN training on HPC clusters

Workflow provenance in the lifecycle of scientific machine learning

Ensemble Prediction of Job Resources to Improve System Performance for Slurm-Based HPC Systems.

All-gather Algorithms Resilient to Imbalanced Process Arrival Patterns

Applying neural networks to predict HPC-I/O bandwidth over seismic data on lustre file system for ExSeisDat

Application of Container Technology in Numerical Ocean Model: a Kind of High-performance ROMS Containerized Architecture

Dataset for SC 21: Cross-Cluster User and Job Behavior on Production HPC Clusters

빅데이터 분석 및 머신러닝 서비스를 위한 컨테이너 기반 HPC 클러스터 구축 및 성능 분석

SkePU 3: Portable High-Level Programming of Heterogeneous Systems and HPC Clusters

Distributed CNN Inference on Resource-Constrained UAVs for Surveillance Systems: Design and Optimization

Distributed in-memory data management for workflow executions.

Parallel Fast Multipole Method accelerated FFT on HPC clusters

Lead the way for us

Editage

Paperpal

R Discovery

Mind the Graph

HPC Clusters Research Articles

Related Topics

Articles published on HPC Clusters

Analysis of workflow schedulers in simulated distributed environments

Benchmark study of 2D and 3D VOF simulations of a simplex nozzle using a hybrid RANS-LES approach

Toward highly parallel loading of unstructured meshes

A Head Loss Pressure Boundary Condition for Hydraulic Systems

Quantifying the health effects of exposure to non-exhaust road emissions using agent-based modelling (ABM)

Design considerations for workflow management systems use in production genomics research and the clinic

JUSUF: Modular Tier-2 Supercomputing and Cloud Infrastructure at Jülich Supercomputing Centre

Generating UAV high-resolution topographic data within a FOSS photogrammetric workflow using high-performance computing clusters

Momentum-driven adaptive synchronization model for distributed DNN training on HPC clusters

Workflow provenance in the lifecycle of scientific machine learning

Ensemble Prediction of Job Resources to Improve System Performance for Slurm-Based HPC Systems.

All-gather Algorithms Resilient to Imbalanced Process Arrival Patterns

Applying neural networks to predict HPC-I/O bandwidth over seismic data on lustre file system for ExSeisDat

Application of Container Technology in Numerical Ocean Model: a Kind of High-performance ROMS Containerized Architecture

Dataset for SC 21: Cross-Cluster User and Job Behavior on Production HPC Clusters

빅데이터 분석 및 머신러닝 서비스를 위한 컨테이너 기반 HPC 클러스터 구축 및 성능 분석

SkePU 3: Portable High-Level Programming of Heterogeneous Systems and HPC Clusters

Distributed CNN Inference on Resource-Constrained UAVs for Surveillance Systems: Design and Optimization

Distributed in-memory data management for workflow executions.

Parallel Fast Multipole Method accelerated FFT on HPC clusters