For the complete documentation index, see llms.txt. Markdown versions of all pages are available by appending .md to any URL (e.g. /max/get-started.md).

Mojo function

row_mean_of_squares_qk_gpu_block

def row_mean_of_squares_qk_gpu_block[in_dtype: DType, out_dtype: DType, out_mut: Bool, out_layout: TensorLayout, out_origin: Origin[mut=out_mut], q_layout: TensorLayout, q_origin: Origin[mut=q_origin.mut], k_layout: TensorLayout, k_origin: Origin[mut=k_origin.mut], //, simd_width: Int, max_warps_per_block: Int](output: TileTensor[out_dtype, out_layout, out_origin], q: TileTensor[in_dtype, q_layout, q_origin], k: TileTensor[in_dtype, k_layout, k_origin], q_cols: Int, k_cols: Int) where out_mut

Fused per-row mean of squares for Q and K in a single launch.

The grid is 2D: block_idx.x selects the row and block_idx.y selects the operand (0 = Q, 1 = K). Each block owns one (row, operand) reduction and writes column block_idx.y of the [rows, 2] output. This replaces two row_mean_of_squares launches plus a concat with one launch. All operands (q [M, Nq], k [M, Nk], and the [M, 2] output) are passed directly as TileTensors and loaded/stored in-kernel.