
pay attention to the left image!
Key abstraction of the CUDA Programming model, alongside the memory hierarchy.
this is the lowest level in the CUDA Thread hierarchy.
executes a stream of instructions.
hardware resources that effect arithmetic and logic instructions are called Cores (or Pipes). Note that each Core runs a single thread.
Warp Scheduler selects which thread the core should execute.
HW: threads execute on individualcores
fyi) the term CTA is used in the context of PTX/SASS but basically means the same as blocks or thread blocks.
Each thread has a unique index-based identifier within its thread blocks. This makes assigning work to individual threads easier.
All threads within a block are scheduled simultaneously onto the same SMs by warp scheduler. Since they share the same L1 Cache Memory, they can coordinate through shared memory and synchronized with barriers.
WARNING: Shared Memory has NO RELATION w/ Streaming Multiprocessor (SM)