For the complete documentation index, see llms.txt. Markdown versions of all pages are available by appending .md to any URL (e.g. /max/get-started.md).

Mojo module

k2q_csr_device

Device (GPU) reverse-CSR builder for KV-block-major sparse MHA.

GPU port of host k2q_csr.build_k2q_csr (its oracle). Inverts the query-major selection q2k [head_kv, total_q, topK] into the KV-block-major CSR + schedule the block-major forward/combine kernels consume, emitting the SAME contract tensors the host builder produces directly into device buffers (no host round- trip).

Five stages: row_map round-robin (batch, kv_block) -> row_linear + row_coords hist per-(CTA,warp) unit histograms -> tile_counts + row_counts row_prefix one block per head: row_counts -> row_ptr, emit scheduler_metadata tile_prefix scan tile_counts along the (CTA,warp) unit axis -> per-unit base scatter per-unit q-sequential write of qsplit / split_counts

The hist/scatter grid is (g, head_kv): heads run as parallel CTAs (grid.y) and the q-range is tiled across g CTAs x kwarps warps (g_total units), each owning a contiguous q-sub-range -- so g*head_kv CTAs spread the q*topk edge stream across the SMs (a single under-gridded CTA serializes it on one SM). Per-row slots are reserved by an exclusive prefix scan over the units (PR + PT), so scatter writes without cross-CTA atomics and the per-unit ranges concatenate to a globally q-ascending row, byte-identical to the host's sequential writer.

SMEM histogram/cursor entries are one Int32 per (warp,row) (no int16 bit-pack): no per-warp count cap, 2x the per-warp SMEM; kwarps is picked so two CTAs still fit per SM at the BF16/non-paged row counts. q_per_cta chunking: each non- empty row -> ceil(row_count/q_per_cta) work items, default 128 = the fwd CTA query cap (BM).

Structs

K2qCsrDeviceSizes: Host-computed sizing for the device CSR (allocated by the caller).

Functions

build_k2q_csr_device: Builds the reverse-CSR + schedule on the device into caller buffers.
k2q_csr_sizes: Returns the device-CSR sizing (matches the host builder's formulas).

Structs​

Functions​

Structs

Functions