https://youtu.be/9MvD-XsowsE?si=GePSfotRWhTYfPIE

Large Scale distributed training

is the process of training neural networks on tens, hundreds, thousands, or even tens of thousands of devices concurrently, which has become the new norm in deep learning.

Five degrees of parallelism exploited in large scale distributed training

Data parallelism (DP, FSDP, HSDP)
Context parallelism
Pipeline parallelism
Tensor parallelism
Activation checkpointing (for memory saving)

1. Introduction to Large Scale Distributed Training

GPU Hardware Overview

GPU: GPU stands for Graphics Processing Unit.

H100 Architecture: The Nvidia H100 has compute cores surrounded by 80 GB of HBM memory.

Memory Bandwidth: Data moves between HBM and cores at about 3 TB/s.
L2 Cache: Inside the compute core is a much smaller 50 MB L2 cache.
Streaming Multiprocessors (SMs): The core contains 132 streaming multiprocessors (SMs).
- Binning Process: Hardware uses binning to account for defects; 144 theoretical SMs become 132 functional ones.

Large Scale distributed training

1. Introduction to Large Scale Distributed Training

GPU Hardware Overview

H100 Memory Hierarchy