For the complete documentation index, see llms.txt. Markdown versions of all pages are available by appending .md to any URL (e.g. /max/get-started.md).
Mojo function
balanced_target_q_per_cta
def balanced_target_q_per_cta(total_q: Int, topk: Int, blk_kv: Int, head_kv: Int, num_sms: Int, bm: Int = Int(128)) -> Int
Load-balanced queries-per-CTA cap for the scheduler q-chunking.
Targets ~num_sms*2 work items so each CTA processes ceil(q_count / bm)
Q-groups against ONE resident KV block (instead of one CTA per bm queries).
Rounded up to a multiple of bm (the fwd loops in bm-query groups), floored
at bm, and capped at topk * blk_kv * 2 so a few huge rows can't starve the
SMs.
Returns:
Was this page helpful?
Thank you! We'll create more content like this.
Thank you for helping us improve!