how doestl.program_id work?
→ Returns the id of the current program instance along the given axis.
pid) into the GPU hardwarehow does tl.arange(0, BLOCK_SIZE) work?
arange works similar how NumPy works; takes range from start arg to finish arg (which in this case would be 0 ~ 1023)how does mask work?
how does GPU Programming differ from CPU Programming?
grid = **lambda** meta: (triton.cdiv(n_elements, meta['BLOCK_SIZE']), ) → wtf
add_kernel[grid](x, y, output, n_elements, BLOCK_SIZE=1024) how does this syntax work
Before undrestanding the code, we need to understand some basic structure/architecture of GPU and how code is matched with that.

Grid = Collection of Blocks
grids : 1 2 or 3