For the complete documentation index, see llms.txt. Markdown versions of all pages are available by appending .md to any URL (e.g. /max/get-started.md).
Python class
AttentionDispatchResolver
AttentionDispatchResolverโ
class max.nn.kv_cache.AttentionDispatchResolver(devices, is_mla, n_kv_heads_per_device, num_q_heads_per_device=None, is_fp8_kv=False)
Bases: object
Resolves packed attention decode metadata via kernel custom ops.
Supports both MHA (mo.mha.decode.get_num_partitions) and MLA
(mo.mla.compute_dispatch_args.scalar) decode kernels, selected from the
is_mla flag.
-
Parameters:
probe_lengths()โ
probe_lengths(max_cache_length, q_max_seq_len=1)
Returns cache lengths to probe for distinct num_partitions.
These are the cache lengths warmed up during graph capture. MHA probes
at 256-token granularity; MLA probes at a finer 64-token granularity
(and, under speculative decoding, adds extra probes to hit more
(num_partitions, draft_num_partitions) pairs). The selected
granularity follows is_mla.
resolve_attn_key()โ
resolve_attn_key(batch_size, max_prompt_length, max_cache_valid_length)
Returns the resolved decode dispatch key for the given shape.
Empty / degenerate replicas (batch_size <= 0 or a CPU-only
resolver) return a sentinel key (num_partitions=1) without invoking
the dispatch kernels.
Was this page helpful?
Thank you! We'll create more content like this.
Thank you for helping us improve!