IMPORTANT: To view this page as Markdown, append `.md` to the URL (e.g. /max/get-started.md). For the complete documentation index, see llms.txt.
Skip to main content
For the complete documentation index, see llms.txt. Markdown versions of all pages are available by appending .md to any URL (e.g. /max/get-started.md).

Python class

GGUFQAttentionWithRope

GGUFQAttentionWithRope​

class max.nn.attention.GGUFQAttentionWithRope(*, rope, num_attention_heads, num_key_value_heads, hidden_size, kv_params, dtype, quantization_encoding, devices=None, linear_cls=<class 'max.nn.linear.Linear'>, scale=None, has_bias=False, clip_qkv=None, mask_variant=MHAMaskVariant.CAUSAL_MASK)

source

Bases: AttentionWithRope

Implementation of attention with GGUF quantized weights.

Initializes the GGUF attention layer.

Parameters:

  • rope (RotaryEmbedding) – The rope layer to borrow the freqs_cis value from.
  • num_attention_heads (int) – The number of attention heads.
  • num_key_value_heads (int) – Number of key/value heads.
  • hidden_size (int) – The dimension of the hidden states.
  • kv_params (KVCacheParams) – KV Cache params, including number of kv heads, head dim, and dtype.
  • layer_idx – The layer number associated with this Attention block.
  • dtype (DType) – DType of the weights, should always be uint8.
  • devices (list[DeviceRef] | None) – Device(s) on which to place the weights and run the computation. If multiple are provided, the first device is used. Use TensorParallelAttentionWithRope to use all devices during attention computation.
  • quantization_encoding (QuantizationEncoding) – Quantization encoding of the weights.
  • linear_cls (Callable[..., Linear]) – Linear class to use for the outputs dense layer.
  • scale (float | None) – Value used to scale the results of the attention output.
  • has_bias (bool) – Whether to use an attention bias.
  • clip_qkv (float | None) – If provided, the QKV weights are clamped between [-clip_qkv, clip_qkv]
  • mask_variant (MHAMaskVariant) – Attention mask used by the flash-attention kernel. Defaults to MHAMaskVariant.CAUSAL_MASK.

rope​

rope: RotaryEmbedding

source

wqkv​

property wqkv: TensorValue

source

The concatenation of q, k, and v weight vectors.

wqkv_bias​

property wqkv_bias: TensorValue | None

source

The concatenation of q, k, and v bias weight vectors.