IMPORTANT: To view this page as Markdown, append `.md` to the URL (e.g. /max/get-started.md). For the complete documentation index, see llms.txt.
Skip to main content
For the complete documentation index, see llms.txt. Markdown versions of all pages are available by appending .md to any URL (e.g. /max/get-started.md).

Mojo module

k2q_csr

Host reverse-CSR builder for KV-block-major sparse MHA, SM100.

Inverts the query-major selection q2k_indices [head_kv, total_q, topK] (per query token, the batch-local KV-BLOCK ids it attends, < 0 = unused) into the KV-block-major CSR the block-major forward/combine kernels consume: for each (batch, kv-block) pair the list of queries that selected it. Sequential CPU build; k2q_csr_device is the GPU port and the oracle for it.

A CSR "row" is one (batch, kv_block) pair. Rows are numbered LEVEL-MAJOR round- robin: all batches' block-0 first (batch order, skipping batches with no block-0), then all block-1, etc., so scheduler_metadata's (row_linear, batch, kv_block) stays consistent with the device builder.

Contract tensors emitted (all i32, batch-local q indices):

  • k2q_row_ptr [head_kv, total_rows + 1] exclusive prefix of per-row counts.
  • scheduler_metadata [work_capacity, 6] (head_kv, row_linear, q_begin, q_count, batch, kv_block) + work_count [1]. Each non-empty row is split into ceil(row_count / q_per_cta) work items (q-chunking) so a row selected by more than q_per_cta queries is served by multiple CTAs.
  • split_counts [B, max_seqlen_q, head_kv] per-query valid-block count.
  • qsplit_indices [head_kv, total_q * topK] q | (split_slot << 24), topK <= 255.

Structs​

  • ​K2qCsr: Reverse-CSR + schedule for one sparse-MHA forward pass (host-built).

Functions​