Accelerate Literature Icon
Want to do a literature review? Try our new Literature Review workflow

DeepSpeed

  • Abstract
  • Literature Map
  • Similar Papers
Abstract
Translate article icon Translate Article Star icon

Explore new techniques in Microsoft's open source library called DeepSpeed, which advances large model training by improving scale, speed, cost, and usability, unlocking the ability to train 100-billion-parameter models. DeepSpeed is compatible with PyTorch. One piece of our library, called ZeRO, is a new parallelized optimizer that greatly reduces the resources needed for model and data parallelism while massively increasing the number of parameters that can be trained. Researchers have used these breakthroughs to create Turing Natural Language Generation (Turing-NLG), which at the time of its release was the largest publicly known language model at 17 billion parameters. In addition we will also go over our latest transformer kernel advancements that led the DeepSpeed team to achieve the world fastest BERT pretraining record. The Zero Redundancy Optimizer (ZeRO) is a novel memory optimization technology for large-scale distributed deep learning. ZeRO can train deep learning models with over 100 billion parameters on the current generation of GPU clusters at three to five times the throughput of the current best system. It also presents a clear path to training models with trillions of parameters, demonstrating an unprecedented leap in deep learning system technology. DeepSpeed brings state-of-the-art training techniques, such as ZeRO, optimized kernels, distributed training, mixed precision, and checkpointing, through lightweight APIs compatible with PyTorch. With just a few lines of code changes to your PyTorch model, you can leverage DeepSpeed to address underlying performance challenges and boost the speed and scale of your training.

Similar Papers
  • Research Article
  • 10.1016/j.future.2024.107597
Convergence-aware optimal checkpointing for exploratory deep learning training jobs
  • Nov 8, 2024
  • Future Generation Computer Systems
  • Hongliang Li + 5 more

Convergence-aware optimal checkpointing for exploratory deep learning training jobs

  • Research Article
  • Cite Count Icon 11
  • 10.3906/elk-2105-36
Deep hyperparameter transfer learning for diabetic retinopathy classification
  • Oct 4, 2021
  • TURKISH JOURNAL OF ELECTRICAL ENGINEERING & COMPUTER SCIENCES
  • Mahesh S Patil + 6 more

The detection of diabetic retinopathy (DR) in millions of diabetic patients across the globe is a challenging problem. Diagnosis of retinopathy is a lengthy and tedious process, requiring a medical professional to assess the individual fundus images of a patient's retina. This process can be automated by applying deep learning (DL) technology given a huge dataset. The problems associated with DL are the unavailability of a large dataset and their higher training time. The DL model's best performance is achieved using set of optimal hyperparameters (OHPs) obtained by performing costly iterations of hyperparameter optimization (HPO). These problems can be addressed by using transfer learning (TL) technique in both DL model training and HPO. TL in HP tuning is the focus of this work. The authors study the applicability of EyePACS DR dataset's OHPs to other DR datasets, forming the basis of the research question addressed in this work. The DR classification is performed using a ResNet model trained on the EyePACS (kaggle) and Indian diabetic retinopathy image dataset (IDRiD) datasets. Various HPs tuned in this work are data augmentation configuration, number of layers, optimizers, data samplers, learning rate, and momentum. The authors demonstrate that EyePACS dataset's OHPs are suitable for training with IDRiD dataset without needing to tune HPs for IDRiD dataset from scratch. The OHPs for a task and their reusability is poorly reported in the literature. Therefore, the EyePACS DR dataset's OHPs reported here can be used by other researchers. Moreover, the researchers working on other DR datasets can also apply the same OHPs since they are reusable and no iterations of HPO are required. The OHPs are provided for both EyePAC and IDRiD datasets after being tuned from scratch, which can be used as starting point for HPO by others.

  • Conference Article
  • Cite Count Icon 9
  • 10.1109/ipdpsw52791.2021.00144
Distributed Deep Learning Using Volunteer Computing-Like Paradigm
  • Jun 1, 2021
  • Medha Atre + 2 more

Use of Deep Learning (DL) in commercial applications such as image classification, sentiment analysis and speech recognition is increasing. When training DL models with large number of parameters and/or large datasets, cost and speed of training can become prohibitive. Distributed DL training solutions that split a training job into subtasks and execute them over multiple nodes can decrease training time. However, the cost of current solutions, built predominantly for cluster computing systems, can still be an issue. In contrast to cluster computing systems, Volunteer Computing (VC) systems can lower the cost of computing, but applications running on VC systems have to handle fault tolerance, variable network latency and heterogeneity of compute nodes, and the current solutions are not designed to do so. We design a distributed solution that can run DL training on a VC system by using a data parallel approach. We implement a novel asynchronous SGD scheme called VC-ASGD suited for VC systems. In contrast to traditional VC systems that lower cost by using untrustworthy volunteer devices, we lower cost by leveraging preemptible computing instances on commercial cloud platforms. By using preemptible instances that require applications to be fault tolerant, we lower cost by 70-90% and improve data security.

  • Research Article
  • Cite Count Icon 27
  • 10.1007/s10664-020-09894-9
Automated end-to-end management of the modeling lifecycle in deep learning
  • Feb 19, 2021
  • Empirical Software Engineering
  • Gharib Gharibi + 4 more

Deep learning has improved the state-of-the-art results in an ever-growing number of domains. This success heavily relies on the development and training of deep learning models–an experimental, iterative process that produces tens to hundreds of models before arriving at a satisfactory result. While there has been a surge in the number of tools and frameworks that aim at facilitating deep learning, the process of managing the models and their artifacts is still surprisingly challenging and time-consuming. Existing model-management solutions are either tailored for commercial platforms or require significant code changes. Moreover, most of the existing solutions address a single phase of the modeling lifecycle, such as experiment monitoring, while ignoring other essential tasks, such as model deployment. In this paper, we present a software system to facilitate and accelerate the deep learning lifecycle, named ModelKB. ModelKB can automatically manage the modeling lifecycle end-to-end, including (1) monitoring and tracking experiments; (2) visualizing, searching for, and comparing models and experiments; (3) deploying models locally and on the cloud; and (4) sharing and publishing trained models. Moreover, our system provides a stepping-stone for enhanced reproducibility. ModelKB currently supports TensorFlow 2.0, Keras, and PyTorch, and it can be extended to other deep learning frameworks easily.

  • PDF Download Icon
  • Research Article
  • Cite Count Icon 17
  • 10.1016/j.modpat.2023.100331
Automating Ground Truth Annotations for Gland Segmentation Through Immunohistochemistry
  • Sep 15, 2023
  • Modern pathology : an official journal of the United States and Canadian Academy of Pathology, Inc
  • Tushar Kataria + 6 more

Microscopic evaluation of glands in the colon is of utmost importance in the diagnosis of inflammatory bowel disease and cancer. When properly trained, deep learning pipelines can provide a systematic, reproducible, and quantitative assessment of disease-related changes in glandular tissue architecture. The training and testing of deep learning models require large amounts of manual annotations, which are difficult, time-consuming, and expensive to obtain. Here, we propose a method for automated generation of ground truth in digital hematoxylin and eosin (H&E)–stained slides using immunohistochemistry (IHC) labels. The image processing pipeline generates annotations of glands in H&E histopathology images from colon biopsy specimens by transfer of gland masks from KRT8/18, CDX2, or EPCAM IHC. The IHC gland outlines are transferred to coregistered H&E images for training of deep learning models. We compared the performance of the deep learning models to that of manual annotations using an internal held-out set of biopsy specimens as well as 2 public data sets. Our results show that EPCAM IHC provides gland outlines that closely match manual gland annotations (Dice = 0.89) and are resilient to damage by inflammation. In addition, we propose a simple data sampling technique that allows models trained on data from several sources to be adapted to a new data source using just a few newly annotated samples. The best performing models achieved average Dice scores of 0.902 and 0.89 on Gland Segmentation and Colorectal Adenocarcinoma Gland colon cancer public data sets, respectively, when trained with only 10% of annotated cases from either public cohort. Altogether, the performances of our models indicate that automated annotations using cell type–specific IHC markers can safely replace manual annotations. Automated IHC labels from single-institution cohorts can be combined with small numbers of hand-annotated cases from multi-institutional cohorts to train models that generalize well to diverse data sources.

  • Research Article
  • Cite Count Icon 6
  • 10.1109/lca.2023.3261136
XLA-NDP: Efficient Scheduling and Code Generation for Deep Learning Model Training on Near-Data Processing Memory
  • Jan 1, 2023
  • IEEE Computer Architecture Letters
  • Jueon Park + 1 more

Deep learning (DL) model training must address the memory bottleneck to continue scaling. Processing-in-memory approaches can be a viable solution as they move computations near or into the memory, reducing substantial data movement. However, to deploy applications on such hardware, end-to-end software support is crucial for efficient computation mapping and scheduling as well as extensible code generation, but no consideration has been made for DL training workloads. In this paper, we propose XLA-NDP, a compiler and runtime solution for NDPX, a near-data processing (NDP) architecture integrated with an existing DL training framework. XLA-NDP offloads NDPX kernels and schedules them to overlap with GPU kernels to maximize parallelism based on GPU and NDPX costs, while providing a template-based code generator with low-level optimizations. The experiments showed that XLA-NDP provides up to 1.41x speedup (1.24x on average) over the GPU baseline for four DL model training.

  • PDF Download Icon
  • Research Article
  • Cite Count Icon 4
  • 10.3390/jcm9020364
Validation of Deep Learning-Based Artifact Correction on Synthetic FLAIR Images in a Different Scanning Environment
  • Jan 29, 2020
  • Journal of Clinical Medicine
  • Kyeong Hwa Ryu + 8 more

We investigated the capability of a trained deep learning (DL) model with a convolutional neural network (CNN) in a different scanning environment in terms of ameliorating the quality of synthetic fluid-attenuated inversion recovery (FLAIR) images. The acquired data of 319 patients obtained from the retrospective review were used as test sets for the already trained DL model to correct the synthetic FLAIR images. Quantitative analyses were performed for native synthetic FLAIR and DL-FLAIR images against conventional FLAIR images. Two neuroradiologists assessed the quality and artifact degree of the native synthetic FLAIR and DL-FLAIR images. The quantitative parameters showed significant improvement on DL-FLAIR in all individual tissue segments and total intracranial tissues than on the native synthetic FLAIR (p < 0.0001). DL-FLAIR images showed improved image quality with fewer artifacts than the native synthetic FLAIR images (p < 0.0001). There was no significant difference in the preservation of the periventricular white matter hyperintensities and lesion conspicuity between the two FLAIR image sets (p = 0.217). The quality of synthetic FLAIR images was improved through artifact correction using the trained DL model on a different scan environment. DL-based correction can be a promising solution for ameliorating the quality of synthetic FLAIR images to broaden the clinical use of synthetic magnetic resonance imaging (MRI).

  • Conference Article
  • 10.1109/icc45855.2022.9839126
PipeCompress: Accelerating Pipelined Communication for Distributed Deep Learning
  • May 16, 2022
  • Juncai Liu + 3 more

Distributed learning is widely used to accelerate the training of deep learning models, but it is known that communication efficiency limits the scalability of distributed learning systems. Current gradient compression techniques provide promising methods to reduce communication time, but the extra time incurred by compression is not negligible. After compression techniques are applied, the communication time is significantly reduced because the data size needed to communicate becomes much smaller, but compressing gradients is time-consuming and it becomes a new bottleneck. In this paper, we design and implement PipeCompress, a system to decouple compression and backpropagation operations into two processes and pipeline the two processes to hide compression time. We also propose a specialized inter-process communication mechanism based on the characteristics of DNN distributed training to improve the efficiency of passing messages between the two processes, which makes sure that the decoupling does not bring much extra inter-process communication time cost. As far as we know, this is the first work that notices the overhead of compression and pipelines backpropagation and compression operations to hide compression time in distributed learning. Experiments show that PipeCompress can significantly hide compression time, reduce iteration time, and accelerate the training process on various DNN models.

  • Conference Article
  • Cite Count Icon 2
  • 10.1109/drbsd754563.2021.00006
Mitigating Catastrophic Forgetting in Deep Learning in a Streaming Setting Using Historical Summary
  • Nov 1, 2021
  • Sajal Dash + 4 more

Recent advancements in scientific equipment and the adaptation of electronics and the Internet of Things (IoT) in our everyday lives resulted in large and complex data production at a high rate. Making meaningful and timely knowledge discovery at a modest cost from this big data is difficult for computing power and storage limitations. Training deep learning models incrementally in a streaming setting can help us with overcoming these limitations. However, in a well-known phenomenon named catastrophic forgetting, incrementally trained models increasingly perform poorly on the past data. To mitigate catastrophic forgetting in training in a streaming setting, we propose constructing a historical summary over time and use the summary with newly arrived data during incremental training. We propose various data summarization techniques such as random sampling, micro clustering, coreset computation, and Auto Encoders to counteract catastrophic forgetting. We built a pipeline for incremental training with a historical summary for training deep learning models for streaming data. We demonstrate the effectiveness of historical summary in mitigating catastrophic forgetting using three case studies involving three different deep learning applications: an Artificial Neural Network (ANN) for classification task on MNIST dataset, a language model (RNN-LM) on the WikiText2 dataset, and a Convolutional Neural Network (CNN), ResNet50 to classify the ImageNet dataset. Through the training of the models, we observe that catastrophic forgetting is evident in ANN and CNN but not in an RNN. For the first task, our method recovers up to 47.9% lost accuracy due to catastrophic forgetting. For the third task, the historical summary recovers classification accuracy by up to 25%. For the second task, though there is not proof of catastrophic forgetting, the training performance (PPL) improves by up to 26% with historical summary.

  • Research Article
  • Cite Count Icon 17
  • 10.1016/j.jnca.2020.102590
JPAS: Job-progress-aware flow scheduling for deep learning clusters
  • Mar 11, 2020
  • Journal of Network and Computer Applications
  • Pan Zhou + 4 more

JPAS: Job-progress-aware flow scheduling for deep learning clusters

  • Research Article
  • Cite Count Icon 28
  • 10.1109/jbhi.2022.3190923
A Fully Deep Learning Paradigm for Pneumoconiosis Staging on Chest Radiographs.
  • Oct 1, 2022
  • IEEE Journal of Biomedical and Health Informatics
  • Wenjian Sun + 10 more

Pneumoconiosis staging has been a very challenging task, both for certified radiologists and computer-aided detection algorithms. Although deep learning has shown proven advantages in the detection of pneumoconiosis, it remains challenging in pneumoconiosis staging due to the stage ambiguity of pneumoconiosis and noisy samples caused by misdiagnosis when they are used in training deep learning models. In this article, we propose a fully deep learning pneumoconiosis staging paradigm that comprises a segmentation procedure and a staging procedure. The segmentation procedure extracts lung fields in chest radiographs through an Asymmetric Encoder-Decoder Network (AED-Net) that can mitigate the domain shift between multiple datasets. The staging procedure classifies the lung fields into four stages through our proposed deep log-normal label distribution learning and focal staging loss. The two cascaded procedures can effectively solve the problem of model overfitting caused by stage ambiguity and noisy labels of pneumoconiosis. Besides, we collect a clinical chest radiograph dataset of pneumoconiosis from the certified radiologist's diagnostic reports. The experimental results on this novel pneumoconiosis dataset confirm that the proposed deep pneumoconiosis staging paradigm achieves an Accuracy of 90.4%, a Precision of 84.8%, a Sensitivity of 78.4%, a Specificity of 95.6%, an F1-score of 80.9% and an Area Under the Curve (AUC) of 96%. In particular, we achieve 68.4% Precision, 76.5% Sensitivity, 95% Specificity, 72.2% F1-score and 89% AUC on the early pneumoconiosis 'stage-1'.

  • Research Article
  • Cite Count Icon 7
  • 10.1007/s12083-017-0574-4
Training deep neural network on multiple GPUs with a model averaging method
  • Jun 15, 2017
  • Peer-to-Peer Networking and Applications
  • Qiongjie Yao + 2 more

Deep learning has shown considerable promise in numerous practical machine learning applications. However training deep learning models is highly time-consuming. To solve this problem, many studies design distributed deep learning systems with multiple graphics processing units (GPUs) on a single machine or across machines. Data parallelism is the usually method to use multiple GPUs. However, this method is not suitable for all deep learning models such as fully connected deep neural network (DNN) because of the transfer overhead. In this paper we have analyzed the transfer overhead. Parameters synchronization is the key factor to cause the transfer overhead. To reduce parameters synchronization, we propose a multiple-GPUs framework based on the model averaging where each GPU trains a whole model until convergence and the CPU averages the models as the final optimal model. The only one parameters synchronization occurs when all GPUs have completed the training model, thus dramatically reducing transfer overhead. Experimental results show that the model averaging method achieves a speedup of 1.6x with two GPUs and 1.8x with four GPUs compared with the training method on a single GPU, respectively. Compared with the data parallelism method, it also achieves a speedup of 17x and 25x on two GPUs and four GPUs, respectively.

  • Research Article
  • Cite Count Icon 32
  • 10.1109/tpds.2018.2833074
MPCA SGD—A Method for Distributed Training of Deep Learning Models on Spark
  • Nov 1, 2018
  • IEEE Transactions on Parallel and Distributed Systems
  • Matthias Langer + 3 more

Many distributed deep learning systems have been published over the past few years, often accompanied by impressive performance claims. In practice these figures are often achieved in high performance computing (HPC) environments with fast InfiniBand network connections. For average deep learning practitioners this is usually an unrealistic scenario, since they cannot afford access to these facilities. Simple re-implementations of algorithms such as EASGD [1] for standard Ethernet environments often fail to replicate the scalability and performance of the original works [2] . In this paper, we explore this particular problem domain and present MPCA SGD, a method for distributed training of deep neural networks that is specifically designed to run in low-budget environments. MPCA SGD tries to make the best possible use of available resources, and can operate well if network bandwidth is constrained. Furthermore, MPCA SGD runs on top of the popular Apache Spark [3] framework. Thus, it can easily be deployed in existing data centers and office environments where Spark is already used. When training large deep learning models in a gigabit Ethernet cluster, MPCA SGD achieves significantly faster convergence rates than many popular alternatives. For example, MPCA SGD can train ResNet-152 [4] up to 5.3x faster than state-of-the-art systems like MXNet [5] , up to 5.3x faster than bulk-synchronous systems like SparkNet [6] and up to 5.3x faster than decentral asynchronous systems like EASGD [1] .

  • Conference Article
  • Cite Count Icon 3
  • 10.1109/icce56470.2023.10043527
Addressing Straggler Problem Through Dynamic Partial All-Reduce for Distributed Deep Learning in Heterogeneous GPU Clusters
  • Jan 6, 2023
  • Hyungjun Kim + 3 more

Distributed deep learning is an inevitable choice in learning large-scale deep learning models today. Beyond a certain level, training deep learning models can take days or months, which can lead to catastrophic consequences in applications that require rapid trend reflection or decision-making. Distributed deep learning is largely divided into synchronous and asynchronous methods according to the synchronization method at the time of parameter update. The former updates the parameter with the average value of gradient calculated by all workers, and there is a problem that the processing speed is matched to the slowest worker. The latter is faster because it updates parameters without waiting for the slowest worker, but can converge more slowly to the optimal state due to the stale gradient problem. In this paper, we propose Dynamic Partial All-Reduce, a distributed learning algorithm that uses a synchronous method but dynamically manages whether workers participate in global synchronization to autonomously control the effects of the straggler problem. In this algorithm, if a slow worker is detected, the influence of straggler is limited by excluding the worker from global communication and allowing the remaining workers to update the parameters. Then, when the slow worker recovers the normal speed, it returns to the synchronization group again. In this way, the decision is made by comparing what causes the greater loss in speed and convergence between the loss of computational power due to the exclusion of one GPU from the learning process, the omission of as much learning data as distributed to the worker, and the slowdown due to straggler. We implemented this algorithm based on PyTorch and Horovod, and all experiments were conducted on Tencent Cloud.

  • Research Article
  • Cite Count Icon 21
  • 10.1016/j.sysarc.2023.102927
Communication compression techniques in distributed deep learning: A survey
  • Jun 22, 2023
  • Journal of Systems Architecture
  • Zeqin Wang + 5 more

Communication compression techniques in distributed deep learning: A survey

Save Icon
Up Arrow
Open/Close
Notes

Save Important notes in documents

Highlight text to save as a note, or write notes directly

You can also access these Documents in Paperpal, our AI writing tool

Powered by our AI Writing Assistant