Python module
max.nn.attention
The attention mechanism used within the model.
Attention layersโ
AttentionWithRope | Implementation of attention that uses Rotary Position Embedding (RoPE). |
|---|---|
DistributedAttentionImpl | A generalized Distributed attention interface. |
GGUFQAttentionWithRope | Implementation of attention with GGUF quantized weights. |
GPTQAttentionWithRope | Implementation of the GPTQ attention layer. |
LatentAttentionWithRope | Implementation of Latent Attention with Rope. |
MultiheadAttention | Multihead attention that handles both single and distributed computation. |
RaggedAttention | Layer that computes the self attention score for ragged inputs. |
TensorParallelAttentionWithRope | Tensor-parallel wrapper that delegates sharding to the base module. |
TensorParallelLatentAttentionWithRope | Distributed tensor parallel implementation of the Latent Attention with Rope. |
Mask configurationโ
AttentionMaskVariant | Defines the string mask variant identifiers used in attention configuration. |
|---|---|
MHAMaskVariant | Defines the integer mask variant codes used by multihead attention kernels. |
Functionsโ
num_heads_for_device | Computes the number of attention heads assigned to a specific device. |
|---|
Was this page helpful?
Thank you! We'll create more content like this.
Thank you for helping us improve!