Skip to main content
Log in

Python module

attention_with_rope

An opaque KV Cache optimized attention mechanism with Rope.

AttentionWithRope

class max.nn.attention.attention_with_rope.AttentionWithRope(*, rope: ~max.nn.rotary_embedding.OptimizedRotaryEmbedding, num_attention_heads: int, num_key_value_heads: int, hidden_size: int, kv_params: ~max.nn.kv_cache.cache_params.KVCacheParams, layer_idx: int, devices: list[max.graph.type.DeviceRef] | None = None, dtype: ~max._core.dtype.DType = DType.float32, linear_cls: ~typing.Callable[[...], ~max.nn.linear.Linear] = <class 'max.nn.linear.Linear'>, stacked_qkv: bool = False, scale: float | None = None, has_bias: bool = False, float8_config: ~max.nn.linear.Float8Config | None = None, clip_qkv: float | None = None)

Implementation of attention that uses the rope frequency.

Initializes the attention layer.

  • Parameters:

    • rope – The rope layer to borrow the freq_cis value from.
    • num_attention_heads – The number of attention heads.
    • num_key_value_heads – Number of key/value heads.
    • hidden_size – The dimension of the hidden states.
    • kv_params – KV Cache Params, including the number of kv heads, the head dim, and data type.
    • layer_idx – The layer number associated with this Attention block.
    • dtype – DType of the
    • devices – Device to place the weights and run the computation. If multiple are provided, the first device is used. Use DistributedAttentionWithRope to use all devices during attention computation.
    • linear_cls – Linear class to use for the outputs dense layer.
    • stacked_qkv – Whether the weights are stacked together.
    • scale – Value used to scale the results of the attention output.
    • has_bias – Whether to use an attention bias.
    • clip_qkv – If provided, the QKV weights are clamped between [-clip_qkv, clip_qkv]

build_subgraph()

build_subgraph(name: str, x_type: TensorType, kv_collection_type: _OpaqueType) → Module

max_input_scale

property max_input_scale*: TensorValue | None*

The max of q, k, and v scale input vectors.

max_weight_scale

property max_weight_scale*: TensorValue | None*

The max of q, k, and v scale weight vectors.

rope

rope*: OptimizedRotaryEmbedding*

wqkv

property wqkv*: TensorValue*

The concatenation of q, k, and v weight vectors.

wqkv_bias

property wqkv_bias*: TensorValue | None*

The concatenation of q, k, and v bias weight vectors.

AttentionWithRopeQKV

class max.nn.attention.attention_with_rope.AttentionWithRopeQKV(n_heads: 'int', kv_params: 'KVCacheParams', layer_idx: 'int', wq: 'TensorValueLike', wk: 'TensorValueLike', wv: 'TensorValueLike', wo: 'LinearV1', scale: 'float', rope: 'OptimizedRotaryEmbedding')

rope

rope*: OptimizedRotaryEmbedding*

AttentionWithRopeV1

class max.nn.attention.attention_with_rope.AttentionWithRopeV1(n_heads: int, kv_params: KVCacheParams, layer_idx: TensorValue, wqkv: TensorValue, wo: LinearV1, scale: float, rope: OptimizedRotaryEmbedding, bias: TensorValue | None = None, perm_idx: TensorValue | None = None, quantization_config: QuantizationConfig | None = None)

Implementation of attention that uses the rope frequency.

Deprecated: Use AttentionWithRope instead.

bias

bias*: TensorValue | None* = None

perm_idx

perm_idx*: TensorValue | None* = None

quantization_config

quantization_config*: QuantizationConfig | None* = None

rope

rope*: OptimizedRotaryEmbedding*

DistributedAttentionWithRope

class max.nn.attention.attention_with_rope.DistributedAttentionWithRope(**kwargs)

Initializes the attention layer.

  • Parameters:

    • rope – The rope layer to borrow the freq_cis value from.
    • num_attention_heads – The number of attention heads.
    • num_key_value_heads – Number of key/value heads.
    • hidden_size – The dimension of the hidden states.
    • kv_params – KV Cache Params, including the number of kv heads, the head dim, and data type.
    • layer_idx – The layer number associated with this Attention block.
    • dtype – DType of the
    • devices – Device to place the weights and run the computation. If multiple are provided, the first device is used. Use DistributedAttentionWithRope to use all devices during attention computation.
    • linear_cls – Linear class to use for the outputs dense layer.
    • stacked_qkv – Whether the weights are stacked together.
    • scale – Value used to scale the results of the attention output.
    • has_bias – Whether to use an attention bias.
    • clip_qkv – If provided, the QKV weights are clamped between [-clip_qkv, clip_qkv]

build_subgraph()

build_subgraph(name, x_type: list[max.graph.type.TensorType], kv_collection_type: list[max.graph.type._OpaqueType]) → Module

GGUFQAttentionWithRope

class max.nn.attention.attention_with_rope.GGUFQAttentionWithRope(*, rope: ~max.nn.rotary_embedding.OptimizedRotaryEmbedding, num_attention_heads: int, num_key_value_heads: int, hidden_size: int, kv_params: ~max.nn.kv_cache.cache_params.KVCacheParams, layer_idx: int, dtype: ~max._core.dtype.DType, quantization_encoding: ~max.graph.quantization.QuantizationEncoding, devices: list[max.graph.type.DeviceRef] | None = None, linear_cls: ~typing.Callable[[...], ~max.nn.linear.Linear] = <class 'max.nn.linear.Linear'>, scale: float | None = None, has_bias: bool = False, clip_qkv: float | None = None)

Implementation of attention with GGUF quantized weights.

Initializes the attention layer.

  • Parameters:

    • rope – The rope layer to borrow the freq_cis value from.
    • num_attention_heads – The number of attention heads.
    • num_key_value_heads – Number of key/value heads.
    • hidden_size – The dimension of the hidden states.
    • kv_params – KV Cache Params, including the number of kv heads, the head dim, and data type.
    • layer_idx – The layer number associated with this Attention block.
    • dtype – DType of the weights, should always be uint8.
    • devices – Device to place the weights and run the computation. If multiple are provided, the first device is used. Use DistributedAttentionWithRope to use all devices during attention computation.
    • quantization_encoding – Quantization encoding of the weights.
    • linear_cls – Linear class to use for the outputs dense layer.
    • scale – Value used to scale the results of the attention output.
    • has_bias – Whether to use an attention bias.
    • clip_qkv – If provided, the QKV weights are clamped between [-clip_qkv, clip_qkv]

rope

rope*: OptimizedRotaryEmbedding*

wqkv

property wqkv*: TensorValue*

The concatenation of q, k, and v weight vectors.

wqkv_bias

property wqkv_bias*: TensorValue | None*

The concatenation of q, k, and v bias weight vectors.

GPTQAttentionWithRope

class max.nn.attention.attention_with_rope.GPTQAttentionWithRope(quantization_config: ~max.graph.quantization.QuantizationConfig, rope: ~max.nn.rotary_embedding.OptimizedRotaryEmbedding, num_attention_heads: int, num_key_value_heads: int, hidden_size: int, kv_params: ~max.nn.kv_cache.cache_params.KVCacheParams, layer_idx: int, devices: list[max.graph.type.DeviceRef] | None = None, dtype: ~max._core.dtype.DType = DType.float32, scale: float | None = None, linear_cls: ~typing.Callable[[...], ~max.nn.linear.Linear] = <class 'max.nn.linear.Linear'>)

Implementation of the GPT-Q attention layer.

Initializes the attention layer.

  • Parameters:

    • rope – The rope layer to borrow the freq_cis value from.
    • num_attention_heads – The number of attention heads.
    • num_key_value_heads – Number of key/value heads.
    • hidden_size – The dimension of the hidden states.
    • kv_params – KV Cache Params, including the number of kv heads, the head dim, and data type.
    • layer_idx – The layer number associated with this Attention block.
    • dtype – DType of the
    • devices – Device to place the weights and run the computation. If multiple are provided, the first device is used. Use DistributedAttentionWithRope to use all devices during attention computation.
    • linear_cls – Linear class to use for the outputs dense layer.
    • stacked_qkv – Whether the weights are stacked together.
    • scale – Value used to scale the results of the attention output.
    • has_bias – Whether to use an attention bias.
    • clip_qkv – If provided, the QKV weights are clamped between [-clip_qkv, clip_qkv]

wqkv

property wqkv*: TensorValue*

The concatenation of q, k, and v weight vectors.

LatentAttentionWithRope

class max.nn.attention.attention_with_rope.LatentAttentionWithRope(*, rope: ~max.nn.rotary_embedding.OptimizedRotaryEmbedding, num_attention_heads: int, num_key_value_heads: int, hidden_size: int, kv_params: ~max.nn.kv_cache.cache_params.KVCacheParams, layer_idx: int, dtype: ~max._core.dtype.DType, devices: list[max.graph.type.DeviceRef] | None = None, linear_cls: ~typing.Callable[[...], ~max.nn.linear.Linear] = <class 'max.nn.linear.Linear'>, scale: float | None = None, has_bias: bool = False, clip_qkv: float | None = None, q_lora_rank: int | None = None, kv_lora_rank: int = 512, qk_nope_head_dim: int = 128, qk_rope_head_dim: int = 64, v_head_dim: int = 128, buffer_size: int = 16384)

Implementation of Latent Attention with Rope.

Initializes the attention layer.

  • Parameters:

    • rope – The rope layer to borrow the freq_cis value from.
    • num_attention_heads – The number of attention heads.
    • num_key_value_heads – Number of key/value heads.
    • hidden_size – The dimension of the hidden states.
    • kv_params – KV Cache Params, including the number of kv heads, the head dim, and data type.
    • layer_idx – The layer number associated with this Attention block.
    • dtype – DType of the weights, should always be uint8.
    • devices – Device to place the weights and run the computation. If multiple are provided, the first device is used. Use DistributedAttentionWithRope to use all devices during attention computation.
    • quantization_encoding – Quantization encoding of the weights.
    • linear_cls – Linear class to use for the outputs dense layer.
    • scale – Value used to scale the results of the attention output.
    • has_bias – Whether to use an attention bias.
    • clip_qkv – If provided, the QKV weights are clamped between [-clip_qkv, clip_qkv]
    • buffer_size – Buffer size for storing the temporal results during prefill, in unit of tokens.

rope

rope*: OptimizedRotaryEmbedding*

w_uk_uv

property w_uk_uv*: list[max.graph.value.TensorValue]*

The concatenation of q, k, and v weight vectors.

wqkv

property wqkv*: TensorValue*

The concatenation of q, k, and v weight vectors.

wqkv_bias

property wqkv_bias*: TensorValue | None*

The concatenation of q, k, and v bias weight vectors.

distribute_value()

max.nn.attention.attention_with_rope.distribute_value(v: TensorValue, devices: list[max.graph.type.DeviceRef]) → list[max.graph.value.TensorValue]