For the complete documentation index, see llms.txt. Markdown versions of all pages are available by appending .md to any URL (e.g. /max/get-started.md).

Mojo module

k2q_csr

Host reverse-CSR builder for KV-block-major sparse MHA, SM100.

Inverts the query-major selection q2k_indices [head_kv, total_q, topK] (per query token, the batch-local KV-BLOCK ids it attends, < 0 = unused) into the KV-block-major CSR the block-major forward/combine kernels consume: for each (batch, kv-block) pair the list of queries that selected it. Sequential CPU build; k2q_csr_device is the GPU port and the oracle for it.

A CSR "row" is one (batch, kv_block) pair. Rows are numbered LEVEL-MAJOR round- robin: all batches' block-0 first (batch order, skipping batches with no block-0), then all block-1, etc., so scheduler_metadata's (row_linear, batch, kv_block) stays consistent with the device builder.

Contract tensors emitted (all i32, batch-local q indices):

k2q_row_ptr [head_kv, total_rows + 1] exclusive prefix of per-row counts.
scheduler_metadata [work_capacity, 6] (head_kv, row_linear, q_begin, q_count, batch, kv_block) + work_count [1]. Each non-empty row is split into ceil(row_count / q_per_cta) work items (q-chunking) so a row selected by more than q_per_cta queries is served by multiple CTAs.
split_counts [B, max_seqlen_q, head_kv] per-query valid-block count.
qsplit_indices [head_kv, total_q * topK] q | (split_slot << 24), topK <= 255.

Structs

K2qCsr: Reverse-CSR + schedule for one sparse-MHA forward pass (host-built).

Functions

balanced_target_q_per_cta: Load-balanced queries-per-CTA cap for the scheduler q-chunking.
build_k2q_csr: Builds the reverse-CSR + schedule from the query-major selection.

Structs​

Functions​

Structs

Functions