Grid

A grid is the top-level organizational structure of the threads executing a kernel function on a GPU. A grid consists of multiple thread blocks, which are further divided into individual threads that execute the kernel function concurrently.

The division of a grid into thread blocks serves multiple crucial purposes:

First, it breaks down the overall workload — managed by the grid — into smaller, more manageable portions that can be processed independently. This division allows for better resource utilization and scheduling flexibility across multiple streaming multiprocessors (SMs) in the GPU.
Second, thread blocks provide a scope for threads to collaborate through shared memory and synchronization primitives, enabling efficient parallel algorithms and data sharing patterns.
Finally, thread blocks help with scalability by allowing the same program to run efficiently across different GPU architectures, as the hardware can automatically distribute blocks based on available resources.

The programmer specifies the number of thread blocks in a grid and how they are arranged across one, two, or three dimensions. Typically, the programmer determines the dimensions of the grid based on the dimensionality of the data to process. For example, a programmer might choose a 1-dimensional grid for processing large vectors, a 2-dimensional grid for processing matrices, and a 3-dimensional grid for processing the frames of a video. Each block within the grid is assigned a unique block index that determines its position within the grid.

Similarly, the programmer also specifies the number of threads per thread block and how they are arranged across one, two, or three dimensions. Each thread within a block is assigned a unique thread index that determines its position within the block. The combination of block index and thread index uniquely identify the position of a thread within the overall grid.