Abstract
Abstract In large language models (LLMs), full-parameter fine-tuning is crucial for task-specific adaptation. Traditionally, this relies on deep learning training frameworks utilizing the back-propagation scheme. However, this scheme presents inherent issues, e.g. activation memory bottlenecks and backward locking, which limit the efficient computational resource usage. In this work, we propose the design and analysis of ZeROf-Offload, an innovative fine-tuning framework that adapts the forward-gradient scheme. This framework adopts a unique forward-gradient-oriented CPU offload strategy, enabling fine-tuning of billion-scale LLMs solely in the forward phase and enhancing computational efficiency. Empirical evaluations reveal the advantage of eliminating the backward phase in fine-tuning. ZeROf-Offload achieves134 TFlops/GPU for models with over 130 billion parameters on a single DGX-A100 node, outperforming DeepSpeed’s ZeRO-Offload, which achieves 102 TFlops/GPU for models with up to 53.7 billion parameters, the largest size manageable within GPU memory limitations. Furthermore, we have expanded ZeROf-Offload for multi-DGX-A100 environments with integrated 3D parallelism, achieving near-linear speedup across up to 128 GPUs and the token throughput by 1.4x and 1.5x, respectively. The experimental results demonstrate that the proposed ZeROf-Offload has achieved the highest throughput performance compared to all examined state-of-the-art frameworks.
Published Version
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.