# Glossary > Glossary of AI, GPU, and systems programming terms. This file contains all documentation content in a single document following the llmstxt.org standard. ## Attention mask An attention mask specifies which tokens in a sequence a model can attend to during [attention](attention.mdx) score computation. This prevents the model from attending to tokens it should ignore. For example, when sequences in a batch are padded to the same length, an attention mask prevents the model from attending to [padding tokens](padding-tokens.mdx), which carry no meaningful information. ## Causal mask In [transformer](transformer.mdx) models, self-attention — a specific form of attention where a sequence attends to itself — allows every token to attend to all other tokens simultaneously, with no inherent notion of order. [Autoregressive](autoregression.mdx) language models, however, must generate tokens sequentially, meaning each token is conditioned only on preceding tokens. The *causal mask* (also called a *look-ahead mask*) resolves this tension by preventing the self-attention layer from attending to future tokens, ensuring that each token's representation incorporates information only from tokens at previous positions. Concretely, the causal mask is a matrix that sets attention scores to negative infinity for future positions. After the softmax operation, these negative-infinity values become zero, blocking information flow from later tokens to earlier ones. The causal mask is essential during training, where the model processes entire sequences in parallel and must be prevented from attending to tokens it should be predicting. The same constraint applies during inference at the [context encoding](context-encoding.mdx) (also called prefill) phase, where all input tokens are likewise processed in parallel. Without the causal mask, information from later tokens would corrupt the representations of earlier tokens, producing attention scores that differ from what the model learned during training. Note that during the decode phase of inference, the causal mask is effectively redundant: the model generates one token at a time and attends only to the [KV cache](kv-cache.mdx) of previously-seen tokens, so there are no future tokens to mask. --- ## Attention Attention is a mechanism used in AI models such as [transformers](transformer.mdx) that enables the model to assign different levels of importance to different tokens (such as words or pixels) in an input sequence. Unlike traditional architectures that treat all input data equally, attention allows the model to capture relationships between tokens that may be far apart in a sequence. This enables large language models (LLMs) to generate coherent, contextually relevant output. Attention was introduced and refined in the papers [Neural Machine Translation by Jointly Learning to Align and Translate](https://arxiv.org/abs/1409.0473) (Bahdanau et al., 2014) and [Effective Approaches to Attention-based Neural Machine Translation](https://arxiv.org/abs/1508.04025) (Luong et al., 2015). ## How attention works Attention operates on three vectors: a **query** (Q), a **key** (K), and a **value** (V). The query comes from the token (or sequence) that is looking for information, while the keys and values come from the tokens being looked at. These two sources can be different: for example, in machine translation, a decoder token might query the keys and values of an encoder's output to decide which input words are most relevant. This is sometimes called cross-attention. Regardless of where the queries, keys, and values originate, the attention operation follows the same steps: it compares each query against every key to produce a matrix of raw attention scores, normalizes the scores (via softmax) into a probability distribution, and uses those probabilities to compute a weighted combination of the value vectors. The result is a new [embedding](embedding.mdx) for each query token that encodes the information it gathered from the tokens it attended to. ## Self-attention The most well-known form of attention is self-attention, used in [transformer](transformer.mdx) models. In self-attention, the queries, keys, and values all come from the same sequence, which means every token attends to every other token **in its own input**. This allows the model to build a rich understanding of context by evaluating how each token relates to all others, regardless of their distance in the sequence. Because self-attention recomputes scores for every token in the sequence, doing so from scratch at each generation step would be expensive. To avoid this, the model saves the calculated keys and values into the [KV cache](kv-cache.mdx) so they can be reused during the next [autoregression](autoregression.mdx) cycle. ## Scaled dot-product attention The diagram below shows scaled dot-product attention, which is the standard implementation of the attention operation used in transformer models:
The Q, K, and V matrices each have shape `[batchSize, numHeads, S, d]`, where: - `S` is the sequence length (which can be as large as `O(10^3) - O(10^4)`) - `d` is the size per attention head in multi-head attention (usually a power of 2 like 64 or 128, and smaller than `S`). These matrices go through the following operations: 1. `Q x Transpose(K)`: Batched matrix multiplication (`bmm`) that produces a matrix of raw attention scores, one for every pair of tokens. 2. `softmax`: Conversion of the raw scores into a probability distribution so they sum to 1 for each token. 3. `softmax(Q x K^t) x V`: Another `bmm` that uses the normalized scores to blend every token's value vector into a single output embedding per token. A limitation of this implementation is that it materializes an intermediate matrix of shape `[batchSize, numHeads, S, S]`, introducing `O(S^2)` memory allocation and traffic. --- ## Autoregression Autoregression is a process by which an AI model iteratively predicts future values based on previous values in a sequence, using its own output as input to itself. Because each prediction depends on prior context, the process is sequential, which limits parallelization. Autoregression is a standard procedure in [transformer](transformer.mdx) models such as large language models (LLMs) and other models that perform time-series forecasting. This autoregressive process explains why AI chat bots like ChatGPT and Gemini stream the output one word at a time—although they sometimes run so fast that they appear to produce more than one word at a time. --- ## Batching Batching is the process of combining multiple inference requests into a single forward pass through the model, thus executing multiple requests simultaneously and improving computational efficiency. To account for requests with varying sequence lengths, it's common to add techniques such as [padding](padding-tokens.mdx) (to standardize lengths) or [ragged tensors](ragged-tensors.mdx) (to handle variable lengths directly). Batch sizes can be either static or dynamic. Whereas static batching uses a fixed batch size and thus waits until the system receives a specific number of inference requests before sending them into the model, dynamic batching uses a flexible batch size. For example, dynamic batching may send a batch of requests to the model as soon as the batch either reaches a certain number of requests (batch size limit) or it reaches a timeout threshold. Dynamic batching can get a lot more complicated than that with additional tricks that keep GPUs busy instead of waiting for one batch to finish before starting another. One such strategy for large language models (LLMs) is [continuous batching](continuous-batching.mdx). --- ## Context encoding Context encoding is the first phase of inference in a [transformer model](transformer.mdx) (also known as the "prefill" stage). During context encoding, the model processes the [tokenized](tokenization.mdx) input sequence in parallel, computing [attention](attention.mdx) scores for every token. As a byproduct of this computation, the model populates the [KV cache](kv-cache.mdx) with the key and value vectors for each input token, so they don't need to be recomputed during subsequent token generation. After context encoding, the model enters the [autoregressive](autoregression.mdx) decode phase, generating one token at a time. Each new token only needs to compute attention against the existing KV cache rather than reprocessing the entire input, which is what makes generation after the first token comparatively fast. Context encoding is typically the most computationally expensive phase because it must process every input token at once. Although this work can be parallelized across thousands of GPU threads, it is still the primary contributor to time-to-first-token (TTFT) latency. --- ## Continuous batching Continuous batching is a [batching](batching.mdx) technique that can continuously dispatch inference requests to the GPU for [token generation](token-generation.mdx) and dramatically improve GPU utilization. Continuous batching can start executing a new batch even before the previous batch finishes its pass through the model, because this batching technique schedules new processing at the "token level." That is, because large language models (LLMs) generate responses one token at a time, there is a repeated cycle during inference (the token generation phase) in which a new batch can jump in to utilize the GPU, even before a previous batch finishes its pass through the model. That's what it means to operate at the "token level"—the batch scheduler focuses on keeping the GPU busy with token generation at all times, instead of waiting for the previous batch to finish its complete forward pass. This is sometimes called "in-flight batching" in cases where context encoding and token generation requests are combined into the same batch. --- ## Disaggregated inference Disaggregated inference is a serving architecture pattern for large language models (LLMs) in which the two main phases of inference, prefill and decode, are executed on separate hardware resources. You might also see this technique called disaggregated prefill or disaggregated serving. All of these names describe the same core idea: separating the model's inference phases and providing each phase with dedicated resources optimized for its specific computational characteristics. ## Prefill and decode phases LLM inference involves two distinct phases, each with different performance characteristics. **Prefill** (also known as context encoding) is the initial phase where the model processes the entire input prompt. The model performs a full forward pass to initialize its ([KV cache](kv-cache.mdx)) and predict the first output token. This phase is compute-intensive, especially for long prompts, because it involves large-scale matrix operations that demand high floating-point throughput. The key performance metric for this phase is Time-to-First-Token (TTFT): the duration from receiving the input prompt to producing the first output token. **Decode** (also known as token generation) is the phase where the model generates output tokens one at a time, using the KV cache initialized during prefill. By leveraging this cache, the model avoids reprocessing the full input each time. The decoding phase is less compute-intensive per token but becomes memory-bound, relying heavily on efficient access to cached data. The key performance metric here is Inter-Token Latency (ITL): the time taken to generate each subsequent token after the first. ## How disaggregated inference works
A simplified illustration of the separate prefill and decode nodes used in a disaggregated inference serving architecture.
In a disaggregated setup, prefill and decode workloads are routed to different GPUs or GPU nodes. This allows each phase to be optimized independently: - **Prefill nodes** are configured with hardware that prioritizes high compute throughput, suited for the intensive matrix operations required to process long input prompts. - **Decode nodes** are configured with hardware that prioritizes fast memory access, better suited for the sequential, cache-dependent nature of token generation. This separation reduces contention between compute-bound and memory-bound tasks, improves GPU utilization, and allows prefill and decode capacity to be scaled independently. ## When to use disaggregated inference Disaggregated inference is most valuable when minimizing latency is a priority. Because the prefill stage is compute-intensive and the decode stage is memory-bound, isolating the two stages and allocating them to different hardware reduces resource contention and helps achieve both faster TTFT and smoother token streaming. It is especially effective for improving tail latency (such as P95), which measures how long it takes to complete the slowest 95% of requests. Disaggregation also enables more granular parallelism strategies: you can scale prefill and decode nodes independently as demand changes, improving GPU utilization and overall efficiency without over-provisioning capacity just to handle peak workloads. Disaggregated inference is also well-suited to heterogeneous or resource-constrained environments where you need to match each phase with hardware that fits its specific demands. --- ## Embedding An embedding (also known as a "vector embedding") is a numerical representation of information in a high-dimensional vector space. For example, a token embedding (or word embedding) encodes the meaning of words for use in large language models (LLMs). Because artificial neural networks (AI models) are a sequence of mathematical operations, they require numerical structures as input. Vector embeddings are numerical structures that provide a way to express a wide range of complex concepts. They can be used to capture information about all sorts of things, including words, groups of words, sounds, images, and more. For example, [tokenizing](tokenization.mdx) a word like "bank" into a simple number can't encode the different meanings in "bank loan" and "river bank." By converting the token into a high-dimensional vector, we can encode (or "embed") a variety of word meanings that help the model understand word relationships using a notion of closeness along various vector dimensions (expressed through [euclidean distance](https://en.wikipedia.org/wiki/Euclidean_distance)). In this way, when a model encounters the embedding for the word "bank," it can recognize the relationship it has with nearby words such as "loan" or "river," based on the closeness they each have to each other on different vector dimensions (perhaps a "finance" dimension vs a "geography" dimension that were learned during training). Although word embeddings are a type of static embedding that encode the meaning of individual words as input to an LLM, an LLM also builds its own embeddings that are hidden inside the model. For example, as an LLM tries to understand the relationship between each word from an input sequence, it compresses more information into each token embedding based on the attention scores computed in the [self-attention layer](attention.mdx#self-attention). :::note Embedding models Whereas the token embeddings described above use a vector space to represent the meaning of individual tokens, the output from an embedding model uses a vector space to represent the meaning of the input data (a document) as a whole. In this way, an embedding model allows you to programmatically search and compare different documents by analyzing their corresponding embeddings, which can reveal nuanced meaning and semantics far beyond what a pure text comparison can achieve. ::: --- ## Flash attention Flash attention is an optimization technique to compute attention blocks in [transformer](transformer.mdx) models. Traditional [attention](attention.mdx) requires storing large intermediate activation tensors, leading to high memory overhead that slows execution because it requires frequent memory transfers between high-bandwidth memory (HBM) and faster SRAM on the GPU. Flash attention improves performance and reduces the memory footprint for attention layers. It reorders computations with techniques such as tiling to compute attention scores in blocks, and it keeps only small chunks of activations in the faster on-chip SRAM. This allows the model to process much longer sequences without running into memory limitations. By improving the efficiency of attention layers, flash attention enables LLMs to handle much longer contexts, improving their ability to understand and generate complex text. It's particularly beneficial for: - Large language models with long context windows - Vision transformers processing high-resolution images - Multi-modal models with large attention matrices - Fine-tuning large models on limited GPU memory ## Implementation details Flash attention optimizes the classic [attention](attention.mdx) mechanism by: 1. **Tiling the computation**: Breaking the `Q`, `K`, and `V` matrices into smaller blocks that fit in GPU shared memory, which is much faster than global memory. 2. **Fusing operations**: Combining softmax normalization with matrix multiplication for each tile into a single kernel. These help maximize the locality and reduce DRAM (global memory) traffic.
To see an implementation of [FlashAttention-2](https://arxiv.org/abs/2307.08691) as a fused operation, see [`fused_attention.mojo` on GitHub](https://github.com/modular/modular/blob/main/max/examples/custom_ops/kernels/fused_attention.mojo). --- ## Inference routing Inference routing is the process of directing incoming inference requests to the appropriate worker node in a distributed LLM serving cluster. Rather than simply forwarding requests to the next available worker, an inference router uses configurable routing strategies to intelligently distribute traffic based on workload characteristics, hardware state, and caching conditions. The inference router receives a prompt from an HTTP server, analyzes the request to extract information relevant to the selected routing strategy, selects a worker based on the routing algorithm and current cluster state, proxies the request to that worker, and streams the response back to the user.
An overview of the steps taken by an inference router to select a worker and proxy the response.
## Routing strategies | Name | Strategy | Use case | |----------------|----------------------------------------------------------------------|--------------------------------------------------------------| | KV cache-aware | Routes based on shared tokens or document chunks in the KV cache | Repeated prompts in chatbots, agents, or RAG-style workflows | | Least request | Sends requests to the worker with the fewest active requests | Mixed workloads with variable size or latency requirements | | Prefix-aware | Uses consistent hashing on prompt prefixes to group similar requests | Prompts with shared templates or recurring task descriptions | | Random | Selects a backend worker at random | Benchmarking and exposing latency variability | | Round robin | Distributes requests evenly across all workers in sequential order | Stateless, uniform tasks without caching needs | | Sticky session | Routes requests with the same session ID to the same worker | Session-based chat or apps needing memory and continuity | ### KV cache-aware KV cache-aware routing manages requests based on the contents of the [KV cache](kv-cache.mdx) on each worker. It is most useful for retrieval-augmented generation (RAG) systems where many queries share common document chunks or similar inputs, but not identical prefixes. KV cache-aware routing is especially useful for high-throughput workloads with many repeating or similar tokens across queries. ### Least request Least request routing sends new inference requests to the worker currently handling the fewest active requests. This helps balance load dynamically and reduces the chance of overloading any single worker. It is especially useful for variable-length or unpredictable inference tasks and workloads where you want to minimize tail latency. ### Prefix-aware Prefix-aware routing (also known as consistent hashing) examines the prompt prefix in an incoming request and routes it to the worker handling requests with the same prefix. This maximizes prefix cache reuse: for example, if many users share a common system prompt, that prefix stays cached on a single node. When a worker becomes saturated for a popular prefix, the router automatically distributes the load by spilling over to additional workers, maintaining partial cache locality while balancing traffic. Prefix-aware routing is especially useful when many users send queries that start with the same instructions or template, or in multi-turn conversations where session stickiness isn't enabled. ### Random Random routing selects a backend worker at random from the pool of available endpoints for each incoming request. It is most useful for benchmarking: by eliminating routing bias, it exposes average worker performance under distributed load and helps identify latency variability across nodes. ### Round robin Round robin routing distributes incoming requests evenly across all available workers in sequential order, cycling back to the first worker after reaching the last. It is well-suited for stateless or homogeneous workloads where each request is independent and caching is not a concern. ### Sticky session Sticky session routing sends a user's requests to the same worker node for the duration of their session, identified by a session ID in the HTTP request header. If no session header is present, the router falls back to round robin. This strategy is most useful for chatbots or streaming applications where in-flight session state is maintained on the server and continuity across requests matters. ## Relation to KV cache and prefix caching Several routing strategies, particularly prefix-aware and KV cache-aware routing, are designed to maximize the value of the [KV cache](kv-cache.mdx). By routing requests with shared prompt prefixes to the same worker, these strategies reduce redundant computation and improve throughput. See [prefix caching](/max/serve/prefix-caching) for more on how caching works at the serving layer. --- ## KV cache KV (key-value) cache is a memory structure used in [transformer](transformer.mdx) models to store key-value tensors output from [self-attention](attention.mdx#self-attention) layers. The KV cache speeds up inference for transformer models such as large language models (LLMs) by avoiding the need to recompute the self-attention scores for all previous tokens in a sequence. For example, suppose an LLM is trying to complete the sentence, "The quick brown fox..." After the model predicts "jumps" and then begins to predict the next token, the model must know the attention score for every token in the sequence so far (including the one it just predicted). That is, for each step in the [autoregression](autoregression.mdx) cycle, it must process the entire sequence thus far: 1. "The quick brown fox..." 2. "The quick brown fox jumps..." 3. "The quick brown fox jumps over..." And so on. By storing the already-calculated attention scores for previous tokens in KV cache, the model simply reads the KV cache at each step, instead of recomputing those scores all over again. Once the model predicts the next token and calculates its self-attention, it adds it to the KV cache. As the sequence length grows during inference (as more words are generated), the KV cache becomes the dominant factor in a model's memory usage. The sequence length is always limited by the model's total context window length, which varies across models and can usually be configured. --- ## Padding tokens Padding tokens are extra tokens (usually zeros or special tokens) that are added to the input for a model so that the input matches the model's fixed input length or to ensure that all sequences in a [batch](batching.mdx) have the same length. In [transformer](transformer.mdx) models, padding tokens have been mostly replaced with [ragged tensors](ragged-tensors.mdx). --- ## PagedAttention PagedAttention is a memory management technique designed to improve GPU memory utilization during large language model (LLM) serving. Inspired by classical virtual memory and paging methods used in operating systems, PagedAttention divides the [KV cache](kv-cache.mdx) into fixed-size blocks, which are not necessarily stored contiguously in memory. This approach enables more efficient handling of dynamic states in LLMs, allowing the model to manage large context sizes while optimizing memory usage, as described in the 2023 paper [Efficient Memory Management for Large Language Model Serving with PagedAttention](https://arxiv.org/abs/2309.06180) (Kwon, et al., 2023). Also written as "paged attention." --- ## Ragged tensors Ragged tensors is a method for batching multiple requests with differing sequence lengths without the need for [padding tokens](padding-tokens.mdx). Ragged tensors allow sequences of variable lengths to be processed together efficiently by storing them in a compact, non-uniform format. Also sometimes referred to as "packed tensors." --- ## Tokenization Tokenization is the process of dividing the input for an AI model into discrete units that have numerical IDs called tokens. Depending on what the input is (such as text, audio, or an image) the tokens might be based on different words or subwords in text, or different slices/blocks of pixels in images. For example, consider the sentence, "The cat sat on the mat." A word-level tokenization might split this sentence into the following words: "The," "cat," "sat," "on," "the," "mat." Then it replaces each word with a token (a number). The token "vocabulary"—the mapping of words to numbers—is predetermined and may vary from model to model. But tokenizers in large language models (LLMs) are much more sophisticated than that. Among other things, they also tokenize punctuations (or combinations of words and punctuations) and break words into subwords that allow them to tokenize words they've never seen before. Because LLMs are trained on these tokens, they don't actually understand words and letters the way we do. They can only recognize and generate information based on the token vocabulary that they were trained upon. (Popular LLMs have a token vocabulary with over 100,000 tokens.) --- ## Transformer A transformer is a neural network architecture designed to perform complex tasks with sequential data (such as text, speech, and images) in a manner that can be efficiently parallelized on GPUs or other accelerator hardware. This makes them highly effective for natural language processing and other generative AI (GenAI) applications. The transformer model architecture was first introduced in the paper [Attention Is All You Need](https://arxiv.org/abs/1706.03762) (Vaswani, et al., 2017). This design emphasizes the use of [self-attention](attention.mdx#self-attention) mechanisms instead of recurrent structures like recurrent neural networks (RNNs) or long short-term memory networks (LSTMs), which is what allows for the processing to be parallelized across separate compute cores instead of requiring the model to generate predictions synchronously. This design is currently the foundation for all major large language models (LLMs) such as GPT, Llama, Gemini, DeepSeek, and more. --- ## Block index In GPU programming, a block index uniquely identifies a subset of [threads](thread) that execute a [kernel](kernel.mdx) function on the GPU. Threads are grouped into units called [blocks](thread-block.mdx), and multiple blocks together form a larger structure known as a [grid](grid.mdx). Each block within the grid is assigned a unique block index, which can be represented across one, two, or three dimensions. This allows for flexible organization of threads to match the structure of the problem being solved. Within each block, individual threads have their own [thread index](thread-index.mdx), which, together with the block index, determines which part of the problem each thread should work on. This hierarchical structure of grids, blocks, and threads enables efficient workload distribution across the many processing cores of the GPU, maximizing parallel performance. Because a programmer can arrange thread blocks within a grid across one, two, or three dimensions, a block index is a 3-element vector of x, y, and z coordinates. For 2-dimensional arrangements, the z coordinate of all block indices is 0, and for 1-dimensional arrangements, both the y and z coordinates of all block indices are 0. --- ## Grid A grid is the top-level organizational structure of the threads executing a [kernel](kernel.mdx) function on a GPU. A grid consists of multiple [thread blocks](thread-block.mdx) (also known as *workgroups* on AMD GPUs), which are further divided into individual [threads](thread.mdx) (or *work units* on AMD GPUs) that execute the kernel function concurrently. The division of a grid into thread blocks serves multiple crucial purposes: - First, it breaks down the overall workload—managed by the grid—into smaller, more manageable portions that can be processed independently. This division allows for better resource utilization and scheduling flexibility across multiple [streaming multiprocessors](streaming-multiprocessor.mdx) (SMs) in the GPU (or *compute units* on AMD GPUs). - Second, thread blocks provide a scope for threads to collaborate through shared memory and synchronization primitives, enabling efficient parallel algorithms and data sharing patterns. - Finally, thread blocks help with scalability by allowing the same program to run efficiently across different GPU architectures, as the hardware can automatically distribute blocks based on available resources. The programmer specifies the number of thread blocks in a grid and how they are arranged across one, two, or three dimensions. Typically, the programmer determines the dimensions of the grid based on the dimensionality of the data to process. For example, a programmer might choose a 1-dimensional grid for processing large vectors, a 2-dimensional grid for processing matrices, and a 3-dimensional grid for processing the frames of a video. Each block within the grid is assigned a unique [block index](block-index.mdx) that determines its position within the grid. Similarly, the programmer also specifies the number of threads per thread block and how they are arranged across one, two, or three dimensions. Each thread within a block is assigned a unique [thread index](thread-index.mdx) that determines its position within the block. The combination of block index and thread index uniquely identify the position of a thread within the overall grid. --- ## Kernel A kernel is a function that runs on a GPU, executing computations in parallel across a large number of [threads](thread.mdx). Kernels are a fundamental part of general-purpose GPU (GPGPU) programming and are designed to process large datasets efficiently by performing the same operation simultaneously on multiple data elements. --- ## GPU memory GPU memory consists of both on-chip memory and external dynamic random-access memory (DRAM), often referred to as *device memory* (in contrast to the *host memory* used by the CPU). On-chip memory includes: - A register file for each [streaming multiprocessor](streaming-multiprocessor.mdx) (SM), containing the [registers](register.mdx) used by threads executing on the SMs cores - An L1 cache for each SM to cache reads from global memory - Shared memory for each SM, containing data explicitly shared between the threads of a given [thread block](thread-block.mdx) executing on the SM - A read-only constant cache for each SM, which caches data read from the constant memory space in global memory - An L2 cache shared by all SMs that is used to cache accesses to local or global memory, including temporary register spills Device memory includes: - Global memory, which contains data accessible to all threads - Constant memory, which contains data explicitly identified as read-only by the programmer, and which is accessible to all threads - Local memory, which contains data private to an individual thread, such as statically allocated arrays, spilled registers, and other elements of the thread's call stack Data in global memory persists until explicitly freed, even across [kernel](kernel.mdx) functions. This means that one kernel can write data to global memory and then a subsequent kernel can read that data. --- ## Occupancy In GPU programming, occupancy is a measure of the efficiency of the GPU's compute resources. It is defined as the ratio of the number of active [warps](warp.mdx) to the maximum number of warps that can be active on a given [streaming multiprocessor](streaming-multiprocessor.mdx) (SM) at any one time. Higher occupancy can improve parallel execution and hide memory latency, but increasing occupancy does not always boost performance, as factors like memory bandwidth and instruction dependencies may create bottlenecks. The optimal occupancy level depends on the workload and GPU architecture. --- ## Register A GPU register is the fastest form of storage within a [streaming multiprocessor](streaming-multiprocessor.mdx) (SM). Registers store integer and floating point values used frequently by a [thread](thread.mdx), reducing reliance on slower [memory](memory.mdx) types (shared, global, or local memory). Registers are located within an SM in what is referred to as a *register file*. The number of registers depends on the GPU architecture, but modern GPUs support thousands of registers per SM. For each thread that it executes, the SM allocates a set of registers for the private use of that thread. The registers are associated with that thread throughout its lifetime, even if the thread is not currently executing on the SM's cores (for example, if it is blocked waiting for data from memory). A thread can't access registers assigned to a different thread, preventing data conflicts between threads. If the execution of a [kernel](kernel.mdx) function by a thread requires more registers than available, the compiler arranges to spill some register data to the thread's local [memory](memory.mdx). Because local memory access is slower than register access, programmers should try to design their kernels to avoid or limit the amount of spill. --- ## Streaming multiprocessor The basic building block of a GPU is called a *streaming multiprocessor* (SM) on NVIDIA GPUs or a *compute unit* (CU) on AMD GPUs (they're the same idea and we'll refer to them both as SM). SMs sit between the high-level GPU control logic and the individual execution units, acting as self-contained processing factories that can operate independently and in parallel. Multiple SMs are arranged on a single GPU chip, with each SM capable of handling multiple workloads simultaneously. The GPU's global scheduler assigns work to individual SMs, while the memory controller manages data flow between the SMs and various [memory](memory.mdx) hierarchies (global memory, L2 cache, etc.). The number of SMs in a GPU can vary significantly based on the model and intended use case, from a handful in entry-level GPUs to dozens or even hundreds in high-end professional cards. This scalable architecture enables GPUs to maintain excellent performance across different workload sizes and types. Each SM contains several essential components: - **CUDA Cores (NVIDIA)/Stream Processors (AMD):** These are the basic arithmetic logic units (ALUs) that perform integer and floating-point calculations. A single SM can contain dozens or hundreds of these cores. - **Tensor Cores (NVIDIA)/Matrix Cores (AMD):** Specialized units optimized for matrix multiplication and convolution operations. - **Special Function Units (SFUs):** Handle complex mathematical operations like trigonometry, square roots, and exponential functions. - **[Register](register.mdx) Files:** Ultra-fast storage that holds intermediate results and thread-specific data. Modern SMs can have hundreds of kilobytes of register space shared among active [threads](thread.mdx). - **Shared Memory/L1 Cache:** A programmable, low-latency memory space that enables data sharing between threads. This memory is typically configurable between shared memory and L1 cache functions. - **Load/Store Units:** Manage data movement between different memory spaces, handling memory access requests from threads. --- ## Thread block In GPU programming, a thread block (also known as *workgroup* on AMD GPUs) is a subset of threads within a [grid](grid.mdx), which is the top-level organizational structure of the [threads](thread.mdx) executing a [kernel](kernel.mdx) function. As the primary building block for workload distribution, thread blocks serve multiple crucial purposes: - First, they break down the overall workload — managed by the grid — of a kernel function into smaller, more manageable portions that can be processed independently. This division allows for better resource utilization and scheduling flexibility across multiple [streaming multiprocessors](streaming-multiprocessor.mdx) (SMs) in the GPU. - Second, thread blocks provide a scope for threads to collaborate through shared memory and synchronization primitives, enabling efficient parallel algorithms and data sharing patterns. - Finally, thread blocks help with scalability by allowing the same program to run efficiently across different GPU architectures, as the hardware can automatically distribute blocks based on available resources. The programmer specifies the number of thread blocks in a grid and how they are arranged across one, two, or three dimensions. Each block within the grid is assigned a unique [block index](block-index.mdx) that determines its position within the grid. Similarly, the programmer also specifies the number of threads per thread block and how they are arranged across one, two, or three dimensions. Each thread within a block is assigned a unique [thread index](thread-index.mdx) that determines its position within the block. The GPU assigns each thread block within the grid to a streaming multiprocessor (SM) for execution. The SM groups the threads within a block into fixed-size subsets called [warps](warp.mdx), consisting of either 32 or 64 threads each depending on the particular GPU architecture. The SM's warp scheduler manages the execution of warps on the SM's cores. Threads within a block can share data through [shared memory](memory.mdx) and synchronize using built-in mechanisms, but they cannot directly communicate with threads in other blocks. --- ## Thread index In GPU programming, a thread index uniquely identifies the position of a [thread](thread.mdx) within a particular [thread block](thread-block.mdx) executing a [kernel](kernel.mdx) function on the GPU. A thread block is a subset of threads in a [grid](grid.mdx), which is the top-level organizational structure of the threads executing a kernel function. Each block within the grid is also assigned a unique block index, which identifies the block's position within the grid. The combination of block index and thread index uniquely identifies the thread's overall position within the grid, and is used to determine which part of the problem each thread should work on. Because a programmer can arrange threads within a thread block across one, two, or three dimensions, a thread index is a 3-element vector of x, y, and z coordinates. For 2-dimensional arrangements, the z coordinate of all thread indices is 0, and for 1-dimensional arrangements, both the y and z coordinates of all thread indices are 0. --- ## Thread In GPU programming, a thread (also known as a *work unit* on AMD GPUs) is the smallest unit of execution within a [kernel](kernel.mdx) function. Threads are grouped into [thread blocks](thread-block.mdx) (or *workgroups* on AMD GPUs), which are further organized into a [grid](grid.mdx). The programmer specifies the number of thread blocks in a grid and how they are arranged across one, two, or three dimensions. Each block within the grid is assigned a unique [block index](block-index.mdx) that determines its position within the grid. Similarly, the programmer also specifies the number of threads per thread block and how they are arranged across one, two, or three dimensions. Each thread within a block is assigned a unique [thread index](thread-index.mdx) that determines its position within the block. The GPU assigns each thread block within the grid to a [streaming multiprocessor](streaming-multiprocessor.mdx) (SM) for execution. The SM groups the threads within a block into fixed-size subsets called [warps](warp.mdx), consisting of either 32 or 64 threads each depending on the particular GPU architecture. The SM's warp scheduler manages the execution of warps on the SM's cores. The SM allocates a set of [registers](register.mdx) for each thread to store and process values private to that thread. The registers are associated with that thread throughout its lifetime, even if the thread is not currently executing on the SM's cores (for example, if it is blocked waiting for data from memory). Each thread also has access to [local memory](memory.mdx) to store statically allocated arrays, spilled registers, and other elements of the thread's call stack. Threads within a block can share data through shared memory and synchronize using built-in mechanisms, but they cannot directly communicate with threads in other blocks. --- ## Warp In GPU programming, a warp (also known as a *wavefront* on AMD GPUs) is a subset of [threads](thread.mdx) from a [thread block](thread-block.mdx) that execute together in lockstep. When a GPU assigns a thread block to execute on a [streaming multiprocessor](streaming-multiprocessor.mdx) (SM), the SM divides the thread block into warps of 32 or 64 threads, with the exact size depending on the GPU architecture. If a thread block contains a number of threads not evenly divisible by the warp size, the SM creates a partially filled final warp that still consumes the full warp's resources. For example, if a thread block has 100 threads and the warp size is 32, the SM creates: - 3 full warps of 32 threads each (96 threads total) - 1 partial warp with only 4 active threads but still occupying a full warp's worth of resources (32 thread slots) The SM effectively disables the unused thread slots in partial warps, but these slots still consume hardware resources. For this reason, developers generally should make thread block sizes a multiple of the warp size to optimize resource usage. Each thread in a warp executes the same instruction at the same time on different data, following the single instruction, multiple threads (SIMT) execution model. If threads within a warp take different execution paths (called *warp divergence*), the warp serially executes each branch path taken, disabling threads that are not on that path. This means that optimal performance is achieved when all threads in a warp follow the same execution path. An SM can actively manage multiple warps from different thread blocks simultaneously, helping keep execution units busy. For example, the warp scheduler can quickly switch to another ready warp if the current warp's threads must wait for memory access. Warps deliver several key performance advantages: - The hardware needs to manage only warps instead of individual threads, reducing scheduling overhead - Threads in a warp can access contiguous memory locations efficiently through memory coalescing - The hardware automatically synchronizes threads within a warp, eliminating the need for explicit synchronization - The warp scheduler can hide memory latency by switching between warps, maximizing compute resource utilization