For the complete documentation index, see llms.txt. Markdown versions of all pages are available by appending .md to any URL (e.g. /max/get-started.md).
Mojo module
k2q_csr
Host reverse-CSR builder for KV-block-major sparse MHA, SM100.
Inverts the query-major selection q2k_indices [head_kv, total_q, topK] (per
query token, the batch-local KV-BLOCK ids it attends, < 0 = unused) into the
KV-block-major CSR the block-major forward/combine kernels consume: for each
(batch, kv-block) pair the list of queries that selected it. Sequential CPU
build; k2q_csr_device is the GPU port and the oracle for it.
A CSR "row" is one (batch, kv_block) pair. Rows are numbered LEVEL-MAJOR round-
robin: all batches' block-0 first (batch order, skipping batches with no
block-0), then all block-1, etc., so scheduler_metadata's
(row_linear, batch, kv_block) stays consistent with the device builder.
Contract tensors emitted (all i32, batch-local q indices):
k2q_row_ptr [head_kv, total_rows + 1]exclusive prefix of per-row counts.scheduler_metadata [work_capacity, 6](head_kv, row_linear, q_begin, q_count, batch, kv_block) +work_count [1]. Each non-empty row is split intoceil(row_count / q_per_cta)work items (q-chunking) so a row selected by more thanq_per_ctaqueries is served by multiple CTAs.split_counts [B, max_seqlen_q, head_kv]per-query valid-block count.qsplit_indices [head_kv, total_q * topK]q | (split_slot << 24), topK <= 255.
Structsβ
- β
K2qCsr: Reverse-CSR + schedule for one sparse-MHA forward pass (host-built).
Functionsβ
- β
balanced_target_q_per_cta: Load-balanced queries-per-CTA cap for the scheduler q-chunking. - β
build_k2q_csr: Builds the reverse-CSR + schedule from the query-major selection.
Was this page helpful?
Thank you! We'll create more content like this.
Thank you for helping us improve!