Pruning
Make neural network smaller by removing synapses and neurons
주로 불필요한 가중치를 제거(0으로 만듦)해서 네트워크를 가지치기 하는 것
$$
\arg \min_{W_p} L(x; W_p) \\
\text{subject to } ||W_p||_0 < N
$$
Pruning Granularity
Fine-grained / Unstructured
- flexible pruning indices
- usually larger compression ratio since we can flexibly find “redundant” weights
- can deliver speed on some hardware but not GPU
Coarse-grained/Structured/Pattern-based
- N:M sparsity: in each contigous(연속적인) M elements, N of them is pruned
- supported by NVIDIA’s Ampere GPU Architecture which delivers up to 2x speed up
- usually maintains accuracy