IMPORTANT: To view this page as Markdown, append `.md` to the URL (e.g. /max/get-started.md). For the complete documentation index, see llms.txt.
Skip to main content
For the complete documentation index, see llms.txt. Markdown versions of all pages are available by appending .md to any URL (e.g. /max/get-started.md).

Mojo function

row_mean_of_squares_qk_gpu_block

def row_mean_of_squares_qk_gpu_block[in_dtype: DType, out_dtype: DType, out_mut: Bool, out_layout: TensorLayout, out_origin: Origin[mut=out_mut], q_layout: TensorLayout, q_origin: Origin[mut=q_origin.mut], k_layout: TensorLayout, k_origin: Origin[mut=k_origin.mut], //, simd_width: Int, max_warps_per_block: Int](output: TileTensor[out_dtype, out_layout, out_origin], q: TileTensor[in_dtype, q_layout, q_origin], k: TileTensor[in_dtype, k_layout, k_origin], q_cols: Int, k_cols: Int) where out_mut

Fused per-row mean of squares for Q and K in a single launch.

The grid is 2D: block_idx.x selects the row and block_idx.y selects the operand (0 = Q, 1 = K). Each block owns one (row, operand) reduction and writes column block_idx.y of the [rows, 2] output. This replaces two row_mean_of_squares launches plus a concat with one launch. All operands (q [M, Nq], k [M, Nk], and the [M, 2] output) are passed directly as TileTensors and loaded/stored in-kernel.