https://youtu.be/9MvD-XsowsE?si=GePSfotRWhTYfPIE

Large Scale distributed training

is the process of training neural networks on tens, hundreds, thousands, or even tens of thousands of devices concurrently, which has become the new norm in deep learning.

Five degrees of parallelism exploited in large scale distributed training

  1. Data parallelism (DP, FSDP, HSDP)
  2. Context parallelism
  3. Pipeline parallelism
  4. Tensor parallelism
  5. Activation checkpointing (for memory saving)

1. Introduction to Large Scale Distributed Training

GPU Hardware Overview

GPU: GPU stands for Graphics Processing Unit.

H100 Architecture: The Nvidia H100 has compute cores surrounded by 80 GB of HBM memory.

image.png

H100 Memory Hierarchy