Thread (GPU)

In GPU programming, a thread is the smallest unit of execution within a kernel function. Threads are grouped into thread blocks, which are further organized into a grid.

The programmer specifies the number of thread blocks in a grid and how they are arranged across one, two, or three dimensions. Each block within the grid is assigned a unique block index that determines its position within the grid. Similarly, the programmer also specifies the number of threads per thread block and how they are arranged across one, two, or three dimensions. Each thread within a block is assigned a unique thread index that determines its position within the block.

The GPU assigns each thread block within the grid to a streaming multiprocessor (SM) for execution. The SM groups the threads within a block into fixed-size subsets called warps, consisting of either 32 or 64 threads each depending on the particular GPU architecture. The SM's warp scheduler manages the execution of warps on the SM's cores.

The SM allocates a set of registers for each thread to store and process values private to that thread. The registers are associated with that thread throughout its lifetime, even if the thread is not currently executing on the SM's cores (for example, if it is blocked waiting for data from memory). Each thread also has access to local memory to store statically allocated arrays, spilled registers, and other elements of the thread's call stack.

Threads within a block can share data through shared memory and synchronize using built-in mechanisms, but they cannot directly communicate with threads in other blocks.