# Glossary

> Glossary of AI, GPU, and systems programming terms.

This file contains all documentation content in a single document following the llmstxt.org standard.

## Attention mask

An attention mask specifies which tokens in a sequence a model can attend to
during [attention](attention.mdx) score computation. This prevents the model
from attending to tokens it should ignore. For example, when sequences in a
batch are padded to the same length, an attention mask prevents the model from
attending to [padding tokens](padding-tokens.mdx), which carry no meaningful
information.

## Causal mask

In [transformer](transformer.mdx) models, self-attention — a specific form of
attention where a sequence attends to itself — allows every token to attend to
all other tokens simultaneously, with no inherent notion of order.
[Autoregressive](autoregression.mdx) language models, however, must generate
tokens sequentially, meaning each token is conditioned only on preceding
tokens. The *causal mask* (also called a *look-ahead mask*) resolves this
tension by preventing the self-attention layer from attending to future tokens,
ensuring that each token's representation incorporates information only from
tokens at previous positions.

Concretely, the causal mask is a matrix that sets attention scores to negative
infinity for future positions. After the softmax operation, these
negative-infinity values become zero, blocking information flow from later
tokens to earlier ones.

The causal mask is essential during training, where the model processes entire
sequences in parallel and must be prevented from attending to tokens it should
be predicting. The same constraint applies during inference at the [context
encoding](context-encoding.mdx) (also called prefill) phase,
where all input tokens are likewise processed in parallel. Without the causal
mask, information from later tokens would corrupt the representations of
earlier tokens, producing attention scores that differ from what the model
learned during training.

Note that during the decode phase of inference, the causal mask is effectively
redundant: the model generates one token at a time and attends only to the [KV
cache](kv-cache.mdx) of previously-seen tokens, so there are no future tokens
to mask.

---

## Attention

Attention is a mechanism used in AI models such as
[transformers](transformer.mdx) that enables the model to assign different
levels of importance to different tokens (such as words or pixels) in an input
sequence. Unlike traditional architectures that treat all input data equally,
attention allows the model to capture relationships between tokens that may be
far apart in a sequence. This enables large language models (LLMs) to generate
coherent, contextually relevant output.

Attention was introduced and refined in the papers [Neural Machine Translation
by Jointly Learning to Align and Translate](https://arxiv.org/abs/1409.0473)
(Bahdanau et al., 2014) and [Effective Approaches to Attention-based Neural
Machine Translation](https://arxiv.org/abs/1508.04025) (Luong et al., 2015).

## How attention works

Attention operates on three vectors: a **query** (Q), a **key** (K), and a
**value** (V). The query comes from the token (or sequence) that is looking for
information, while the keys and values come from the tokens being looked at.
These two sources can be different: for example, in machine translation, a
decoder token might query the keys and values of an encoder's output to decide
which input words are most relevant. This is sometimes called cross-attention.

Regardless of where the queries, keys, and values originate, the attention
operation follows the same steps: it compares each query against every key to
produce a matrix of raw attention scores, normalizes the scores (via softmax)
into a probability distribution, and uses those probabilities to compute a
weighted combination of the value vectors. The result is a new
[embedding](embedding.mdx) for each query token that encodes the information it
gathered from the tokens it attended to.

## Self-attention

The most well-known form of attention is self-attention, used in
[transformer](transformer.mdx) models. In self-attention, the queries, keys, and
values all come from the same sequence, which means every token attends to every
other token **in its own input**. This allows the model to build a rich
understanding of context by evaluating how each token relates to all others,
regardless of their distance in the sequence.

Because self-attention recomputes scores for every token in the sequence, doing
so from scratch at each generation step would be expensive. To avoid this, the
model saves the calculated keys and values into the [KV cache](kv-cache.mdx)
so they can be reused during the next [autoregression](autoregression.mdx)
cycle.

## Scaled dot-product attention

The diagram below shows scaled dot-product attention, which is the standard
implementation of the attention operation used in transformer models:

<figure>
  
  
</figure>

The Q, K, and V matrices each have shape `[batchSize, numHeads, S, d]`, where:

- `S` is the sequence length (which can be as large as `O(10^3) - O(10^4)`)

- `d` is the size per attention head in multi-head attention (usually a power
of 2 like 64 or 128, and smaller than `S`).

These matrices go through the following operations:

1. `Q x Transpose(K)`: Batched matrix multiplication (`bmm`) that produces a
matrix of raw attention scores, one for every pair of tokens.

2. `softmax`: Conversion of the raw scores into a probability distribution so
they sum to 1 for each token.

3. `softmax(Q x K^t) x V`: Another `bmm` that uses the normalized scores to
blend every token's value vector into a single output embedding per token.

A limitation of this implementation is that it materializes an intermediate
matrix of shape `[batchSize, numHeads, S, S]`, introducing `O(S^2)` memory
allocation and traffic.

---

## Autoregression

Autoregression is a process by which an AI model iteratively predicts future
values based on previous values in a sequence, using its own output as input to
itself. Because each prediction depends on prior context, the process is
sequential, which limits parallelization.

Autoregression is a standard procedure in [transformer](transformer.mdx) models
such as large language models (LLMs) and other models that perform time-series
forecasting. This autoregressive process explains why AI chat bots like ChatGPT
and Gemini stream the output one word at a time—although they sometimes run so
fast that they appear to produce more than one word at a time.

---

## Batching

Batching is the process of combining multiple inference requests into a single
forward pass through the model, thus executing multiple requests simultaneously
and improving computational efficiency. To account for requests with varying
sequence lengths, it's common to add techniques such as
[padding](padding-tokens.mdx) (to standardize lengths) or [ragged
tensors](ragged-tensors.mdx) (to handle variable lengths directly).

Batch sizes can be either static or dynamic. Whereas static batching uses a
fixed batch size and thus waits until the system receives a specific number of
inference requests before sending them into the model, dynamic batching uses a
flexible batch size. For example, dynamic batching may send a batch of requests
to the model as soon as the batch either reaches a certain number of requests
(batch size limit) or it reaches a timeout threshold.

Dynamic batching can get a lot more complicated than that with additional
tricks that keep GPUs busy instead of waiting for one batch to finish before
starting another. One such strategy for large language models (LLMs) is
[continuous batching](continuous-batching.mdx).

---

## Context encoding

Context encoding is the first phase of inference in a [transformer
model](transformer.mdx) (also known as the "prefill" stage). During context
encoding, the model processes the [tokenized](tokenization.mdx) input sequence
in parallel, computing [attention](attention.mdx) scores for every token. As a
byproduct of this computation, the model populates the [KV cache](kv-cache.mdx)
with the key and value vectors for each input token, so they don't need to be
recomputed during subsequent token generation.

After context encoding, the model enters the
[autoregressive](autoregression.mdx) decode phase, generating one token at a
time. Each new token only needs to compute attention against the existing KV
cache rather than reprocessing the entire input, which is what makes generation
after the first token comparatively fast.

Context encoding is typically the most computationally expensive phase because
it must process every input token at once. Although this work can be
parallelized across thousands of GPU threads, it is still the primary
contributor to time-to-first-token (TTFT) latency.

---

## Continuous batching

Continuous batching is a [batching](batching.mdx) technique that can
continuously dispatch inference requests to the GPU for [token
generation](token-generation.mdx) and dramatically improve GPU utilization.
Continuous batching can start executing a new batch even before the previous
batch finishes its pass through the model, because this batching technique
schedules new processing at the "token level."

That is, because large language models (LLMs) generate responses one token at a
time, there is a repeated cycle during inference (the token generation phase)
in which a new batch can jump in to utilize the GPU, even before a previous
batch finishes its pass through the model. That's what it means to operate at
the "token level"—the batch scheduler focuses on keeping the GPU busy with
token generation at all times, instead of waiting for the previous batch to
finish its complete forward pass.

This is sometimes called "in-flight batching" in cases where context
encoding and token generation requests are combined into the same batch.

---

## Disaggregated inference

Disaggregated inference is a serving architecture pattern for large language
models (LLMs) in which the two main phases of inference, prefill and decode,
are executed on separate hardware resources. You might also see this technique
called disaggregated prefill or disaggregated serving. All of these names
describe the same core idea: separating the model's inference phases and
providing each phase with dedicated resources optimized for its specific
computational characteristics.

## Prefill and decode phases

LLM inference involves two distinct phases, each with different performance
characteristics.

**Prefill** (also known as context encoding) is the initial phase where the
model processes the entire input prompt. The model performs a full forward pass
to initialize its ([KV cache](kv-cache.mdx)) and predict the first output
token. This phase is compute-intensive, especially for long prompts, because it
involves large-scale matrix operations that demand high floating-point
throughput. The key performance metric for this phase is Time-to-First-Token
(TTFT): the duration from receiving the input prompt to producing the first
output token.

**Decode** (also known as token generation) is the phase where the model
generates output tokens one at a time, using the KV cache initialized during
prefill. By leveraging this cache, the model avoids reprocessing the full input
each time. The decoding phase is less compute-intensive per token but becomes
memory-bound, relying heavily on efficient access to cached data. The key
performance metric here is Inter-Token Latency (ITL): the time taken to
generate each subsequent token after the first.

## How disaggregated inference works

<figure>
  
  
  <figcaption>A simplified illustration of the separate prefill and decode
  nodes used in a disaggregated inference serving architecture.</figcaption>
</figure>

In a disaggregated setup, prefill and decode workloads are routed to different
GPUs or GPU nodes. This allows each phase to be optimized independently:

- **Prefill nodes** are configured with hardware that prioritizes high compute
  throughput, suited for the intensive matrix operations required to process
  long input prompts.
- **Decode nodes** are configured with hardware that prioritizes fast memory
  access, better suited for the sequential, cache-dependent nature of token
  generation.

This separation reduces contention between compute-bound and memory-bound
tasks, improves GPU utilization, and allows prefill and decode capacity to be
scaled independently.

## When to use disaggregated inference

Disaggregated inference is most valuable when minimizing latency is a priority.
Because the prefill stage is compute-intensive and the decode stage is
memory-bound, isolating the two stages and allocating them to different
hardware reduces resource contention and helps achieve both faster TTFT and
smoother token streaming.

It is especially effective for improving tail latency (such as P95), which
measures how long it takes to complete the slowest 95% of requests.
Disaggregation also enables more granular parallelism strategies: you can scale
prefill and decode nodes independently as demand changes, improving GPU
utilization and overall efficiency without over-provisioning capacity just to
handle peak workloads.

Disaggregated inference is also well-suited to heterogeneous or
resource-constrained environments where you need to match each phase with
hardware that fits its specific demands.

---

## Embedding

An embedding (also known as a "vector embedding") is a numerical representation
of information in a high-dimensional vector space. For example, a token
embedding (or word embedding) encodes the meaning of words for use in large
language models (LLMs).

Because artificial neural networks (AI models) are a sequence of mathematical
operations, they require numerical structures as input. Vector embeddings are
numerical structures that provide a way to express a wide range of complex
concepts. They can be used to capture information about all sorts of things,
including words, groups of words, sounds, images, and more.

For example, [tokenizing](tokenization.mdx) a word like "bank" into a simple
number can't encode the different meanings in "bank loan" and "river bank." By
converting the token into a high-dimensional vector, we can encode (or "embed")
a variety of word meanings that help the model understand word relationships
using a notion of closeness along various vector dimensions (expressed through
[euclidean distance](https://en.wikipedia.org/wiki/Euclidean_distance)). In
this way, when a model encounters the embedding for the word "bank," it can
recognize the relationship it has with nearby words such as "loan" or "river,"
based on the closeness they each have to each other on different vector
dimensions (perhaps a "finance" dimension vs a "geography" dimension that were
learned during training).

Although word embeddings are a type of static embedding that encode the meaning
of individual words as input to an LLM, an LLM also builds its own embeddings
that are hidden inside the model. For example, as an LLM tries to understand
the relationship between each word from an input sequence, it compresses more
information into each token embedding based on the attention scores computed in
the [self-attention layer](attention.mdx#self-attention).

:::note Embedding models

Whereas the token embeddings described above use a vector space to represent
the meaning of individual tokens, the output from an embedding model uses a
vector space to represent the meaning of the input data (a document) as a
whole. In this way, an embedding model allows you to programmatically search
and compare different documents by analyzing their corresponding embeddings,
which can reveal nuanced meaning and semantics far beyond what a pure text
comparison can achieve.

:::

---

## Flash attention

Flash attention is an optimization technique to compute attention blocks in
[transformer](transformer.mdx) models. Traditional [attention](attention.mdx)
requires storing large intermediate activation tensors, leading to high memory
overhead that slows execution because it requires frequent memory transfers
between high-bandwidth memory (HBM) and faster SRAM on the GPU.

Flash attention improves performance and reduces the memory footprint for
attention layers. It reorders computations with techniques such as tiling to
compute attention scores in blocks, and it keeps only small chunks of
activations in the faster on-chip SRAM. This allows the model to process much
longer sequences without running into memory limitations.

By improving the efficiency of attention layers, flash attention enables LLMs
to handle much longer contexts, improving their ability to understand and
generate complex text. It's particularly beneficial for:

- Large language models with long context windows
- Vision transformers processing high-resolution images
- Multi-modal models with large attention matrices
- Fine-tuning large models on limited GPU memory

## Implementation details

Flash attention optimizes the classic [attention](attention.mdx) mechanism by:

1. **Tiling the computation**: Breaking the `Q`, `K`, and `V` matrices into
  smaller blocks that fit in GPU shared memory, which is much faster than
  global memory.
2. **Fusing operations**: Combining softmax normalization with matrix
   multiplication for each tile into a single kernel.

These help maximize the locality and reduce DRAM (global memory) traffic.

<figure>
  
  
</figure>

To see an implementation of [FlashAttention-2](https://arxiv.org/abs/2307.08691)
as a fused operation, see
[`fused_attention.mojo` on GitHub](https://github.com/modular/modular/blob/main/max/examples/custom_ops/kernels/fused_attention.mojo).

---

## Inference routing

Inference routing is the process of directing incoming inference requests to
the appropriate worker node in a distributed LLM serving cluster. Rather than
simply forwarding requests to the next available worker, an inference router
uses configurable routing strategies to intelligently distribute traffic based
on workload characteristics, hardware state, and caching conditions.

The inference router receives a prompt from an HTTP server, analyzes the
request to extract information relevant to the selected routing strategy,
selects a worker based on the routing algorithm and current cluster state,
proxies the request to that worker, and streams the response back to the user.

<figure>
  
  
  <figcaption>An overview of the steps taken by an inference router to
  select a worker and proxy the response.</figcaption>
</figure>

## Routing strategies

| Name           | Strategy                                                             | Use case                                                     |
|----------------|----------------------------------------------------------------------|--------------------------------------------------------------|
| KV cache-aware | Routes based on shared tokens or document chunks in the KV cache     | Repeated prompts in chatbots, agents, or RAG-style workflows |
| Least request  | Sends requests to the worker with the fewest active requests         | Mixed workloads with variable size or latency requirements   |
| Prefix-aware   | Uses consistent hashing on prompt prefixes to group similar requests | Prompts with shared templates or recurring task descriptions |
| Random         | Selects a backend worker at random                                   | Benchmarking and exposing latency variability                |
| Round robin    | Distributes requests evenly across all workers in sequential order   | Stateless, uniform tasks without caching needs               |
| Sticky session | Routes requests with the same session ID to the same worker          | Session-based chat or apps needing memory and continuity     |

### KV cache-aware

KV cache-aware routing manages requests based on the contents of the
[KV cache](kv-cache.mdx) on each worker. It is most useful for
retrieval-augmented generation (RAG) systems where many queries share common
document chunks or similar inputs, but not identical prefixes. KV cache-aware
routing is especially useful for high-throughput workloads with many repeating
or similar tokens across queries.

### Least request

Least request routing sends new inference requests to the worker currently
handling the fewest active requests. This helps balance load dynamically and
reduces the chance of overloading any single worker. It is especially useful
for variable-length or unpredictable inference tasks and workloads where you
want to minimize tail latency.

### Prefix-aware

Prefix-aware routing (also known as consistent hashing) examines the prompt
prefix in an incoming request and routes it to the worker handling requests
with the same prefix. This maximizes prefix cache reuse: for example, if many
users share a common system prompt, that prefix stays cached on a single node.
When a worker becomes saturated for a popular prefix, the router automatically
distributes the load by spilling over to additional workers, maintaining
partial cache locality while balancing traffic.

Prefix-aware routing is especially useful when many users send queries that
start with the same instructions or template, or in multi-turn conversations
where session stickiness isn't enabled.

### Random

Random routing selects a backend worker at random from the pool of available
endpoints for each incoming request. It is most useful for benchmarking: by
eliminating routing bias, it exposes average worker performance under
distributed load and helps identify latency variability across nodes.

### Round robin

Round robin routing distributes incoming requests evenly across all available
workers in sequential order, cycling back to the first worker after reaching
the last. It is well-suited for stateless or homogeneous workloads where each
request is independent and caching is not a concern.

### Sticky session

Sticky session routing sends a user's requests to the same worker node for
the duration of their session, identified by a session ID in the HTTP request
header. If no session header is present, the router falls back to round robin.
This strategy is most useful for chatbots or streaming applications where
in-flight session state is maintained on the server and continuity across
requests matters.

## Relation to KV cache and prefix caching

Several routing strategies, particularly prefix-aware and KV cache-aware
routing, are designed to maximize the value of the [KV cache](kv-cache.mdx).
By routing requests with shared prompt prefixes to the same worker, these
strategies reduce redundant computation and improve throughput. See
[prefix caching](/max/serve/prefix-caching) for more on how caching works at
the serving layer.

---

## KV cache

KV (key-value) cache is a memory structure used in
[transformer](transformer.mdx) models to store key-value tensors output from
[self-attention](attention.mdx#self-attention) layers. The KV cache speeds up
inference for transformer models such as large language models (LLMs) by
avoiding the need to recompute the self-attention scores for all previous tokens
in a sequence.

For example, suppose an LLM is trying to complete the sentence, "The quick
brown fox..." After the model predicts "jumps" and then begins to predict the
next token, the model must know the attention score for every token in the
sequence so far (including the one it just predicted). That is, for each step
in the [autoregression](autoregression.mdx) cycle, it must process the entire
sequence thus far:

1. "The quick brown fox..."
2. "The quick brown fox jumps..."
3. "The quick brown fox jumps over..."

And so on.

By storing the already-calculated attention scores for previous tokens in KV
cache, the model simply reads the KV cache at each step, instead of recomputing
those scores all over again. Once the model predicts the next token and
calculates its self-attention, it adds it to the KV cache.

As the sequence length grows during inference (as more words are generated),
the KV cache becomes the dominant factor in a model's memory usage. The
sequence length is always limited by the model's total context window length,
which varies across models and can usually be configured.

---

## Padding tokens

Padding tokens are extra tokens (usually zeros or special tokens) that are
added to the input for a model so that the input matches the model's fixed
input length or to ensure that all sequences in a [batch](batching.mdx) have
the same length.

In [transformer](transformer.mdx) models, padding tokens have been mostly
replaced with [ragged tensors](ragged-tensors.mdx).

---

## PagedAttention

PagedAttention is a memory management technique designed to improve GPU memory
utilization during large language model (LLM) serving. Inspired by classical
virtual memory and paging methods used in operating systems, PagedAttention
divides the [KV cache](kv-cache.mdx) into fixed-size blocks, which are not
necessarily stored contiguously in memory. This approach enables more efficient
handling of dynamic states in LLMs, allowing the model to manage large context
sizes while optimizing memory usage, as described in the 2023 paper [Efficient
Memory Management for Large Language Model Serving with
PagedAttention](https://arxiv.org/abs/2309.06180) (Kwon, et al., 2023).

Also written as "paged attention."

---

## Ragged tensors

Ragged tensors is a method for batching multiple requests with differing
sequence lengths without the need for [padding tokens](padding-tokens.mdx).
Ragged tensors allow sequences of variable lengths to be processed together
efficiently by storing them in a compact, non-uniform format.

Also sometimes referred to as "packed tensors."

---

## Tokenization

Tokenization is the process of dividing the input for an AI model into discrete
units that have numerical IDs called tokens. Depending on what the input is
(such as text, audio, or an image) the tokens might be based on different words
or subwords in text, or different slices/blocks of pixels in images.

For example, consider the sentence, "The cat sat on the mat." A word-level
tokenization might split this sentence into the following words: "The," "cat,"
"sat," "on," "the," "mat." Then it replaces each word with a token (a number).
The token "vocabulary"—the mapping of words to numbers—is predetermined and may
vary from model to model.

But tokenizers in large language models (LLMs) are much more sophisticated than
that. Among other things, they also tokenize punctuations (or combinations of
words and punctuations) and break words into subwords that allow them to
tokenize words they've never seen before.

Because LLMs are trained on these tokens, they don't actually understand words
and letters the way we do. They can only recognize and generate information
based on the token vocabulary that they were trained upon. (Popular LLMs have a
token vocabulary with over 100,000 tokens.)

---

## Transformer

A transformer is a neural network architecture designed to perform complex
tasks with sequential data (such as text, speech, and images) in a manner that
can be efficiently parallelized on GPUs or other accelerator hardware. This
makes them highly effective for natural language processing and other
generative AI (GenAI) applications.

The transformer model architecture was first introduced in the paper
[Attention Is All You Need](https://arxiv.org/abs/1706.03762) (Vaswani, et al.,
2017). This design emphasizes the use of
[self-attention](attention.mdx#self-attention) mechanisms instead of recurrent
structures like recurrent neural networks (RNNs) or long short-term memory
networks (LSTMs), which is what allows for the processing to be parallelized
across separate compute cores instead of requiring the model to generate
predictions synchronously. This design is currently the foundation for all major
large language models (LLMs) such as GPT, Llama, Gemini, DeepSeek, and more.

---

## Block index

In GPU programming, a block index uniquely identifies a subset of
[threads](thread) that execute a [kernel](kernel.mdx) function on the GPU.
Threads are grouped into units called [blocks](thread-block.mdx), and multiple
blocks together form a larger structure known as a [grid](grid.mdx).

Each block within the grid is assigned a unique block index, which can be
represented across one, two, or three dimensions. This allows for flexible
organization of threads to match the structure of the problem being solved.
Within each block, individual threads have their own [thread
index](thread-index.mdx), which, together with the block index, determines which
part of the problem each thread should work on. This hierarchical structure of
grids, blocks, and threads enables efficient workload distribution across the
many processing cores of the GPU, maximizing parallel performance.

Because a programmer can arrange thread blocks within a grid across one, two,
or three dimensions, a block index is a 3-element vector of x, y, and z
coordinates. For 2-dimensional arrangements, the z coordinate of all block
indices is 0, and for 1-dimensional arrangements, both the y and z coordinates
of all block indices are 0.

---

## Grid

A grid is the top-level organizational structure of the threads executing a
[kernel](kernel.mdx) function on a GPU. A grid consists of multiple [thread
blocks](thread-block.mdx) (also known as *workgroups* on AMD GPUs), which are
further divided into individual [threads](thread.mdx) (or *work units* on AMD
GPUs) that execute the kernel function concurrently.

The division of a grid into thread blocks serves multiple crucial purposes:

- First, it breaks down the overall workload—managed by the grid—into
  smaller, more manageable portions that can be processed independently. This
  division allows for better resource utilization and scheduling flexibility
  across multiple [streaming multiprocessors](streaming-multiprocessor.mdx)
  (SMs) in the GPU (or *compute units* on AMD GPUs).

- Second, thread blocks provide a scope for threads to collaborate through
  shared memory and synchronization primitives, enabling efficient parallel
  algorithms and data sharing patterns.

- Finally, thread blocks help with scalability by allowing the same program to
  run efficiently across different GPU architectures, as the hardware can
  automatically distribute blocks based on available resources.

The programmer specifies the number of thread blocks in a grid and how they are
arranged across one, two, or three dimensions. Typically, the programmer
determines the dimensions of the grid based on the dimensionality of the data to
process. For example, a programmer might choose a 1-dimensional grid for
processing large vectors, a 2-dimensional grid for processing matrices, and a
3-dimensional grid for processing the frames of a video. Each block within the
grid is assigned a unique [block index](block-index.mdx) that determines its
position within the grid.

Similarly, the programmer also specifies the number of threads per thread block
and how they are arranged across one, two, or three dimensions. Each thread
within a block is assigned a unique [thread index](thread-index.mdx) that
determines its position within the block. The combination of block index and
thread index uniquely identify the position of a thread within the overall grid.

---

## Kernel

A kernel is a function that runs on a GPU, executing computations in parallel
across a large number of [threads](thread.mdx). Kernels are a fundamental
part of general-purpose GPU (GPGPU) programming and are designed to process
large datasets efficiently by performing the same operation simultaneously on
multiple data elements.

---

## GPU memory

GPU memory consists of both on-chip memory and external dynamic random-access
memory (DRAM), often referred to as *device memory* (in contrast to the *host
memory* used by the CPU).

On-chip memory includes:

- A register file for each [streaming
  multiprocessor](streaming-multiprocessor.mdx) (SM), containing the
  [registers](register.mdx) used by threads executing on the SMs cores

- An L1 cache for each SM to cache reads from global memory

- Shared memory for each SM, containing data explicitly shared between the
  threads of a given [thread block](thread-block.mdx) executing on the SM

- A read-only constant cache for each SM, which caches data read from the
  constant memory space in global memory

- An L2 cache shared by all SMs that is used to cache accesses to local or
  global memory, including temporary register spills

Device memory includes:

- Global memory, which contains data accessible to all threads

- Constant memory, which contains data explicitly identified as read-only by the
  programmer, and which is accessible to all threads

- Local memory, which contains data private to an individual thread, such as
  statically allocated arrays, spilled registers, and other elements of the
  thread's call stack

Data in global memory persists until explicitly freed, even across
[kernel](kernel.mdx) functions. This means that one kernel can write data to
global memory and then a subsequent kernel can read that data.

---

## Occupancy

In GPU programming, occupancy is a measure of the efficiency of the GPU's
compute resources. It is defined as the ratio of the number of active
[warps](warp.mdx) to the maximum number of warps that can be active on a given
[streaming multiprocessor](streaming-multiprocessor.mdx) (SM) at any one time.

Higher occupancy can improve parallel execution and hide memory latency, but
increasing occupancy does not always boost performance, as factors like memory
bandwidth and instruction dependencies may create bottlenecks. The optimal
occupancy level depends on the workload and GPU architecture.

---

## Register

A GPU register is the fastest form of storage within a [streaming
multiprocessor](streaming-multiprocessor.mdx) (SM). Registers store integer and
floating point values used frequently by a [thread](thread.mdx), reducing
reliance on slower [memory](memory.mdx) types (shared, global, or local
memory).

Registers are located within an SM in what is referred to as a *register file*.
The number of registers depends on the GPU architecture, but modern GPUs support
thousands of registers per SM.

For each thread that it executes, the SM allocates a set of registers for the
private use of that thread. The registers are associated with that thread
throughout its lifetime, even if the thread is not currently executing on the
SM's cores (for example, if it is blocked waiting for data from memory). A
thread can't access registers assigned to a different thread, preventing data
conflicts between threads. If the execution of a [kernel](kernel.mdx) function
by a thread requires more registers than available, the compiler arranges to
spill some register data to the thread's local [memory](memory.mdx). Because
local memory access is slower than register access, programmers should try to
design their kernels to avoid or limit the amount of spill.

---

## Streaming multiprocessor

The basic building block of a GPU is called a *streaming multiprocessor* (SM)
on NVIDIA GPUs or a *compute unit* (CU) on AMD GPUs (they're the same idea and
we'll refer to them both as SM). SMs sit between the high-level GPU control
logic and the individual execution units, acting as self-contained processing
factories that can operate independently and in parallel.

Multiple SMs are arranged on a single GPU chip, with each SM capable of handling
multiple workloads simultaneously. The GPU's global scheduler assigns work to
individual SMs, while the memory controller manages data flow between the SMs
and various [memory](memory.mdx) hierarchies (global memory, L2 cache, etc.).

The number of SMs in a GPU can vary significantly based on the model and
intended use case, from a handful in entry-level GPUs to dozens or even hundreds
in high-end professional cards. This scalable architecture enables GPUs to
maintain excellent performance across different workload sizes and types.

Each SM contains several essential components:

- **CUDA Cores (NVIDIA)/Stream Processors (AMD):** These are the basic
  arithmetic logic units (ALUs) that perform integer and floating-point
  calculations. A single SM can contain dozens or hundreds of these cores.
- **Tensor Cores (NVIDIA)/Matrix Cores (AMD):** Specialized units optimized for
  matrix multiplication and convolution operations.
- **Special Function Units (SFUs):** Handle complex mathematical operations like
  trigonometry, square roots, and exponential functions.
- **[Register](register.mdx) Files:** Ultra-fast storage that holds intermediate
  results and thread-specific data. Modern SMs can have hundreds of kilobytes of
  register space shared among active [threads](thread.mdx).
- **Shared Memory/L1 Cache:** A programmable, low-latency memory space that
  enables data sharing between threads. This memory is typically configurable
  between shared memory and L1 cache functions.
- **Load/Store Units:** Manage data movement between different memory spaces,
  handling memory access requests from threads.

---

## Thread block

In GPU programming, a thread block (also known as *workgroup* on AMD GPUs) is a
subset of threads within a [grid](grid.mdx), which is the top-level
organizational structure of the [threads](thread.mdx) executing a
[kernel](kernel.mdx) function. As the primary building block for workload
distribution, thread blocks serve multiple crucial purposes:

- First, they break down the overall workload — managed by the grid — of a
  kernel function into smaller, more manageable portions that can be processed
  independently. This division allows for better resource utilization and
  scheduling flexibility across multiple [streaming
  multiprocessors](streaming-multiprocessor.mdx) (SMs) in the GPU.

- Second, thread blocks provide a scope for threads to collaborate through
  shared memory and synchronization primitives, enabling efficient parallel
  algorithms and data sharing patterns.

- Finally, thread blocks help with scalability by allowing the same program to
  run efficiently across different GPU architectures, as the hardware can
  automatically distribute blocks based on available resources.

The programmer specifies the number of thread blocks in a grid and how they are
arranged across one, two, or three dimensions. Each block within the grid is
assigned a unique [block index](block-index.mdx) that determines its position
within the grid. Similarly, the programmer also specifies the number of threads
per thread block and how they are arranged across one, two, or three dimensions.
Each thread within a block is assigned a unique [thread index](thread-index.mdx)
that determines its position within the block.

The GPU assigns each thread block within the grid to a streaming multiprocessor
(SM) for execution. The SM groups the threads within a block into fixed-size
subsets called [warps](warp.mdx), consisting of either 32 or 64 threads each
depending on the particular GPU architecture. The SM's warp scheduler manages
the execution of warps on the SM's cores.

Threads within a block can share data through [shared memory](memory.mdx)
and synchronize using built-in mechanisms, but they cannot directly communicate
with threads in other blocks.

---

## Thread index

In GPU programming, a thread index uniquely identifies the position of a
[thread](thread.mdx) within a particular [thread block](thread-block.mdx)
executing a [kernel](kernel.mdx) function on the GPU. A thread block is a subset
of threads in a [grid](grid.mdx), which is the top-level organizational
structure of the threads executing a kernel function. Each block within the grid
is also assigned a unique block index, which identifies the block's position
within the grid. The combination of block index and thread index uniquely
identifies the thread's overall position within the grid, and is used to
determine which part of the problem each thread should work on.

Because a programmer can arrange threads within a thread block across one, two,
or three dimensions, a thread index is a 3-element vector of x, y, and z
coordinates. For 2-dimensional arrangements, the z coordinate of all thread
indices is 0, and for 1-dimensional arrangements, both the y and z coordinates
of all thread indices are 0.

---

## Thread

In GPU programming, a thread (also known as a *work unit* on AMD GPUs) is the
smallest unit of execution within a [kernel](kernel.mdx) function. Threads are
grouped into [thread blocks](thread-block.mdx) (or *workgroups* on AMD GPUs),
which are further organized into a [grid](grid.mdx).

The programmer specifies the number of thread blocks in a grid and how they are
arranged across one, two, or three dimensions. Each block within the grid is
assigned a unique [block index](block-index.mdx) that determines its position
within the grid. Similarly, the programmer also specifies the number of threads
per thread block and how they are arranged across one, two, or three dimensions.
Each thread within a block is assigned a unique [thread index](thread-index.mdx)
that determines its position within the block.

The GPU assigns each thread block within the grid to a [streaming
multiprocessor](streaming-multiprocessor.mdx) (SM) for execution. The SM groups
the threads within a block into fixed-size subsets called [warps](warp.mdx),
consisting of either 32 or 64 threads each depending on the particular GPU
architecture. The SM's warp scheduler manages the execution of warps on the SM's
cores.

The SM allocates a set of [registers](register.mdx) for each thread to store
and process values private to that thread. The registers are associated with
that thread throughout its lifetime, even if the thread is not currently
executing on the SM's cores (for example, if it is blocked waiting for data from
memory). Each thread also has access to [local memory](memory.mdx) to store
statically allocated arrays, spilled registers, and other elements of the
thread's call stack.

Threads within a block can share data through shared memory and synchronize
using built-in mechanisms, but they cannot directly communicate with threads in
other blocks.

---

## Warp

In GPU programming, a warp (also known as a *wavefront* on AMD GPUs) is a subset
of [threads](thread.mdx) from a [thread block](thread-block.mdx) that execute
together in lockstep. When a GPU assigns a thread block to execute on a
[streaming multiprocessor](streaming-multiprocessor.mdx) (SM), the SM divides
the thread block into warps of 32 or 64 threads, with the exact size depending
on the GPU architecture.

If a thread block contains a number of threads not evenly divisible by the warp
size, the SM creates a partially filled final warp that still consumes the full
warp's resources. For example, if a thread block has 100 threads and the warp
size is 32, the SM creates:

- 3 full warps of 32 threads each (96 threads total)

- 1 partial warp with only 4 active threads but still occupying a full warp's
  worth of resources (32 thread slots)

The SM effectively disables the unused thread slots in partial warps, but these
slots still consume hardware resources. For this reason, developers generally
should make thread block sizes a multiple of the warp size to optimize resource
usage.

Each thread in a warp executes the same instruction at the same time on
different data, following the single instruction, multiple threads (SIMT)
execution model. If threads within a warp take different execution paths (called
*warp divergence*), the warp serially executes each branch path taken, disabling
threads that are not on that path. This means that optimal performance is
achieved when all threads in a warp follow the same execution path.

An SM can actively manage multiple warps from different thread blocks
simultaneously, helping keep execution units busy. For example, the warp
scheduler can quickly switch to another ready warp if the current warp's threads
must wait for memory access.

Warps deliver several key performance advantages:

- The hardware needs to manage only warps instead of individual threads,
  reducing scheduling overhead

- Threads in a warp can access contiguous memory locations efficiently through
  memory coalescing

- The hardware automatically synchronizes threads within a warp, eliminating the
  need for explicit synchronization

- The warp scheduler can hide memory latency by switching between warps,
  maximizing compute resource utilization