https://youtu.be/9MvD-XsowsE?si=GePSfotRWhTYfPIE
Large Scale distributed training
is the process of training neural networks on tens, hundreds, thousands, or even tens of thousands of devices concurrently, which has become the new norm in deep learning.
Five degrees of parallelism exploited in large scale distributed training
- Data parallelism (DP, FSDP, HSDP)
- Context parallelism
- Pipeline parallelism
- Tensor parallelism
- Activation checkpointing (for memory saving)
1. Introduction to Large Scale Distributed Training
GPU Hardware Overview
GPU: GPU stands for Graphics Processing Unit.
H100 Architecture: The Nvidia H100 has compute cores surrounded by 80 GB of HBM memory.

- Memory Bandwidth: Data moves between HBM and cores at about 3 TB/s.
- L2 Cache: Inside the compute core is a much smaller 50 MB L2 cache.
- Streaming Multiprocessors (SMs): The core contains 132 streaming multiprocessors (SMs).
- Binning Process: Hardware uses binning to account for defects; 144 theoretical SMs become 132 functional ones.
H100 Memory Hierarchy