Streaming multiprocessor

The basic building block of a GPU is called a streaming multiprocessor (SM) on NVIDIA GPUs or a compute unit (CU) on AMD GPUs (they're the same idea and we'll refer to them both as SM). SMs sit between the high-level GPU control logic and the individual execution units, acting as self-contained processing factories that can operate independently and in parallel.

Multiple SMs are arranged on a single GPU chip, with each SM capable of handling multiple workloads simultaneously. The GPU's global scheduler assigns work to individual SMs, while the memory controller manages data flow between the SMs and various memory hierarchies (global memory, L2 cache, etc.).

The number of SMs in a GPU can vary significantly based on the model and intended use case, from a handful in entry-level GPUs to dozens or even hundreds in high-end professional cards. This scalable architecture enables GPUs to maintain excellent performance across different workload sizes and types.

Each SM contains several essential components:

CUDA Cores (NVIDIA)/Stream Processors (AMD): These are the basic arithmetic logic units (ALUs) that perform integer and floating-point calculations. A single SM can contain dozens or hundreds of these cores.
Tensor Cores (NVIDIA)/Matrix Cores (AMD): Specialized units optimized for matrix multiplication and convolution operations.
Special Function Units (SFUs): Handle complex mathematical operations like trigonometry, square roots, and exponential functions.
Register Files: Ultra-fast storage that holds intermediate results and thread-specific data. Modern SMs can have hundreds of kilobytes of register space shared among active threads.
Shared Memory/L1 Cache: A programmable, low-latency memory space that enables data sharing between threads. This memory is typically configurable between shared memory and L1 cache functions.
Load/Store Units: Manage data movement between different memory spaces, handling memory access requests from threads.