Python class

GGUFQAttentionWithRope

`GGUFQAttentionWithRope`

class max.nn.attention.GGUFQAttentionWithRope(*, rope, num_attention_heads, num_key_value_heads, hidden_size, kv_params, dtype, quantization_encoding, devices=None, linear_cls=<class 'max.nn.linear.Linear'>, scale=None, has_bias=False, clip_qkv=None)

source

Bases: AttentionWithRope

Implementation of attention with GGUF quantized weights.

Initializes the GGUF attention layer.

Parameters:

rope (RotaryEmbedding) – The rope layer to borrow the freqs_cis value from.
num_attention_heads (int) – The number of attention heads.
num_key_value_heads (int) – Number of key/value heads.
hidden_size (int) – The dimension of the hidden states.
kv_params (KVCacheParams) – KV Cache params, including number of kv heads, head dim, and dtype.
layer_idx – The layer number associated with this Attention block.
dtype (DType) – DType of the weights, should always be uint8.
devices (list[DeviceRef] | None) – Device(s) on which to place the weights and run the computation. If multiple are provided, the first device is used. Use TensorParallelAttentionWithRope to use all devices during attention computation.
quantization_encoding (QuantizationEncoding) – Quantization encoding of the weights.
linear_cls (Callable[..., Linear]) – Linear class to use for the outputs dense layer.
scale (float | None) – Value used to scale the results of the attention output.
has_bias (bool) – Whether to use an attention bias.
clip_qkv (float | None) – If provided, the QKV weights are clamped between [-clip_qkv, clip_qkv]

`rope`

rope: RotaryEmbedding

source

`wqkv`

property wqkv: TensorValue

source

The concatenation of q, k, and v weight vectors.

`wqkv_bias`

property wqkv_bias: TensorValue | None

source

The concatenation of q, k, and v bias weight vectors.

GGUFQAttentionWithRope​

rope​

wqkv​

wqkv_bias​

`GGUFQAttentionWithRope`

`rope`

`wqkv`

`wqkv_bias`