Python class
AttentionDispatchResolver
AttentionDispatchResolver
class max.nn.kv_cache.AttentionDispatchResolver(devices, is_mla, n_kv_heads_per_device, num_q_heads_per_device=None, is_fp8_kv=False)
Bases: object
Resolves packed attention decode metadata via kernel custom ops.
Supports both MHA (mo.mha.decode.get_num_partitions) and MLA
(mo.mla.compute_dispatch_args.scalar) decode kernels. The mode
is selected automatically from kv_params.is_mla.
-
Parameters:
resolve_for_replica()
resolve_for_replica(batch_size, max_prompt_length, max_cache_valid_length)
Returns one dispatch-metadata buffer per shard in a replica.
Was this page helpful?
Thank you! We'll create more content like this.
Thank you for helping us improve!