For the complete documentation index, see llms.txt. Markdown versions of all pages are available by appending .md to any URL (e.g. /max/get-started.md).
Mojo function
row_mean_of_squares_qk_gpu_block
def row_mean_of_squares_qk_gpu_block[in_dtype: DType, out_dtype: DType, out_mut: Bool, out_layout: TensorLayout, out_origin: Origin[mut=out_mut], q_layout: TensorLayout, q_origin: Origin[mut=q_origin.mut], k_layout: TensorLayout, k_origin: Origin[mut=k_origin.mut], //, simd_width: Int, max_warps_per_block: Int](output: TileTensor[out_dtype, out_layout, out_origin], q: TileTensor[in_dtype, q_layout, q_origin], k: TileTensor[in_dtype, k_layout, k_origin], q_cols: Int, k_cols: Int) where out_mut
Fused per-row mean of squares for Q and K in a single launch.
The grid is 2D: block_idx.x selects the row and block_idx.y selects the
operand (0 = Q, 1 = K). Each block owns one (row, operand) reduction and
writes column block_idx.y of the [rows, 2] output. This replaces two
row_mean_of_squares launches plus a concat with one launch. All operands
(q [M, Nq], k [M, Nk], and the [M, 2] output) are passed directly as
TileTensors and loaded/stored in-kernel.
Was this page helpful?
Thank you! We'll create more content like this.
Thank you for helping us improve!