IMPORTANT: To view this page as Markdown, append `.md` to the URL (e.g. /max/get-started.md). For the complete documentation index, see llms.txt.

Skip to main content

For the complete documentation index, see llms.txt. Markdown versions of all pages are available by appending .md to any URL (e.g. /max/get-started.md).

Python module

max.nn.attention

The attention mechanism used within the model.

Attention layers

`AttentionWithRope`	Implementation of attention that uses Rotary Position Embedding (RoPE).
`DistributedAttentionImpl`	A generalized Distributed attention interface.
`GGUFQAttentionWithRope`	Implementation of attention with GGUF quantized weights.
`GPTQAttentionWithRope`	Implementation of the GPTQ attention layer.
`LatentAttentionWithRope`	Implementation of Latent Attention with Rope.
`MultiheadAttention`	Multihead attention that handles both single and distributed computation.
`RaggedAttention`	Layer that computes the self attention score for ragged inputs.
`TensorParallelAttentionWithRope`	Tensor-parallel wrapper that delegates sharding to the base module.
`TensorParallelLatentAttentionWithRope`	Distributed tensor parallel implementation of the Latent Attention with Rope.

Mask configuration

`AttentionMaskVariant`	Defines the string mask variant identifiers used in attention configuration.
`MHAMaskVariant`	Defines the integer mask variant codes used by multihead attention kernels.

Functions

`num_heads_for_device`	Computes the number of attention heads assigned to a specific device.

Attention layers
Mask configuration
Functions