For the complete documentation index, see llms.txt. Markdown versions of all pages are available by appending .md to any URL (e.g. /max/get-started.md).

Mojo function

mha_single_batch

def mha_single_batch[q_type: DType, k_t: MHAOperand, v_t: MHAOperand, output_type: DType, mask_t: MHAMask, *, config: MHAConfig[config.dtype], group: Int = 1, sink: Bool = False](q_ptr: UnsafePointer[Scalar[q_type], ImmutAnyOrigin], k: k_t, v: v_t, output_ptr: UnsafePointer[Scalar[output_type], MutAnyOrigin], scale: Float32, seq_len: Int, max_seq_len: Int, start_pos: UInt32, num_keys: Int, mask_tensor_col: Int, mask: mask_t, batch_idx: Int, sink_weights: OptionalReg[LayoutTensor[q_type, Layout.row_major(-1), ImmutAnyOrigin]])

MHA for token gen where seqlen = 1 and num_keys >= 1.

The general data layout and steps conform to flash attention. Two exceptions:

1 Partition across B, H, and num_keys (TODO). The last one is split-K and will need a separate reduction kernel at the end.

2 First bmm becomes gemv and second bmm becomes gevm. TODO: use more optimized kernels for them