Python class

AttentionDispatchResolver

`AttentionDispatchResolver`

class max.nn.kv_cache.AttentionDispatchResolver(session, device, is_mla, n_kv_heads_per_device, num_q_heads_per_device=None, is_fp8_kv=False)

source

Bases: object

Resolves packed attention decode metadata via kernel custom ops.

Supports both MHA (mo.mha.decode.get_num_partitions) and MLA (mo.mla.compute_dispatch_args.scalar) decode kernels. The mode is selected automatically from kv_params.is_mla.

Parameters:

session (InferenceSession)
device (DeviceRef)
is_mla (bool)
n_kv_heads_per_device (int)
num_q_heads_per_device (int | None)
is_fp8_kv (bool)

AttentionDispatchResolver​

`AttentionDispatchResolver`