Warp

In GPU programming, a warp is a subset of threads from a thread block that execute together in lockstep. When a GPU assigns a thread block to execute on a streaming multiprocessor (SM), the SM divides the thread block into warps of 32 or 64 threads, with the exact size depending on the GPU architecture.

If a thread block contains a number of threads not evenly divisible by the warp size, the SM creates a partially filled final warp that still consumes the full warp's resources. For example, if a thread block has 100 threads and the warp size is 32, the SM creates:

3 full warps of 32 threads each (96 threads total)
1 partial warp with only 4 active threads but still occupying a full warp's worth of resources (32 thread slots)

The SM effectively disables the unused thread slots in partial warps, but these slots still consume hardware resources. For this reason, developers generally should make thread block sizes a multiple of the warp size to optimize resource usage.

Each thread in a warp executes the same instruction at the same time on different data, following the single instruction, multiple threads (SIMT) execution model. If threads within a warp take different execution paths (called warp divergence), the warp serially executes each branch path taken, disabling threads that are not on that path. This means that optimal performance is achieved when all threads in a warp follow the same execution path.

An SM can actively manage multiple warps from different thread blocks simultaneously, helping keep execution units busy. For example, the warp scheduler can quickly switch to another ready warp if the current warp's threads must wait for memory access.

Warps deliver several key performance advantages:

The hardware needs to manage only warps instead of individual threads, reducing scheduling overhead
Threads in a warp can access contiguous memory locations efficiently through memory coalescing
The hardware automatically synchronizes threads within a warp, eliminating the need for explicit synchronization
The warp scheduler can hide memory latency by switching between warps, maximizing compute resource utilization