A kernel is the unit of CUDA code that programmers typically write and compose, akin to a procedure or function in languages targeting CPUs.
A kernel is invoked or launched from a host (CPU)
kernel is a function in host for executing commands in device
collection of all threads executing a kernel is organized as a kernel grid (or, thread block grid or grid of threads)
kernel grid executes across multimple SMs, matching level of memory hiearchy is the global memory.
grid(thread block grid)
threadIdx.x and threadIdx.y
thread indices of a thread
the original loop $i$ and $j$ becomes threadIdx.x and threadIdx.y