Thorough Characterization and Analysis of Large Transformer Model Training At-Scale

Scott Cheng,Venkatram Vishwanath,Siddhisanket Raskar,Murali Emani,Sam Foreman,Jun-Liang Lin,Zhen Xie,Mahmut Taylan Kandemir

doi:10.1145/3639034

Scott Cheng, Venkatram Vishwanath + Show 6 more

Open Access

PDF Available

https://doi.org/10.1145/3639034

Copy DOI

Export

Save

Cite

Abstract
Full-Text PDF
Similar Papers

Abstract

Listen

Large transformer models have recently achieved great success across various domains. With a growing number of model parameters, a large transformer model training today typically involves model sharding, data parallelism, and model parallelism. Thus, the throughput of large-scale model training depends heavily on the network bandwidth since a combination of model sharding and multiple parallelism strategies incurs various costs. However, prior characterizations of transformer models on high-bandwidth DGX machines that use TFLOPS as a metric may not reflect the performance of a system with lower bandwidth. Furthermore, data and model parallelism reveal significantly distinct training profiles on different system bandwidths at scale and, thus, need a thorough study. In this paper, we provide a bottom-up breakdown of training throughput into compute and communication time, and quantitatively analyze their respective influences on overall end-to-end training scaling. Our evaluation involves an in-depth exploration of data parallelism, scaling up to 512 GPUs with limited bandwidth, and examines three model sharding strategies among six model sizes. We also evaluate three combinations of model parallelism on both high and low bandwidth supercomputing systems. Overall, our work provides a broader perspective on large-scale transformer model training, and our analysis and evaluation yield practical insights for predicting training scaling, shaping the future development of supercomputing system design.

Full Text

Published Version (Free)

View/Download pdf

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

Thorough Characterization and Analysis of Large Transformer Model Training At-Scale

Abstract

Published Version (Free)

Talk to us

Similar Papers

More From: Proceedings of the ACM on Measurement and Analysis of Computing Systems

Lead the way for us

Journal: Proceedings of the ACM on Measurement and Analysis of Computing Systems	Publication Date: Feb 16, 2024
Citations: 1

Similar Papers

Thorough Characterization and Analysis of Large Transformer Model Training At-Scale
Scott Cheng ... Mahmut T Kandemir
ACM SIGMETRICS Performance Evaluation Review | VOL. 52
Scott Cheng, et. al.Scott Cheng ... Mahmut T Kandemir
11 Jun 2024
ACM SIGMETRICS Performance Evaluation Review | VOL. 52

Seal: Efficient Training Large Scale Statistical Machine Translation Models on Spark
Rong Gu ... Wenjia Yang
-
Rong Gu, et. al.Rong Gu ... Wenjia Yang
01 Dec 2018
01 Dec 2018

The effect of limited network bandwidth and its utilization by latency hiding techniques in large-scale shared memory systems
Sunil Kim ... A.V Veidenbaum
-
Sunil Kim, et. al. Sunil Kim ... A.V Veidenbaum
23 Nov 2002
23 Nov 2002

The effect of limited network bandwidth and its utilization by latency hiding techniques in large-scale shared memory systems
...
-
, et. al. ...
11 Nov 1997
11 Nov 1997

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

Thorough Characterization and Analysis of Large Transformer Model Training At-Scale

Abstract

Published Version (Free)

Talk to us

Similar Papers

More From: Proceedings of the ACM on Measurement and Analysis of Computing Systems