For the complete documentation index, see llms.txt. Markdown versions of all pages are available by appending .md to any URL (e.g. /max/get-started.md).

Mojo function

row_mean_of_squares_qk_gpu

def row_mean_of_squares_qk_gpu[in_dtype: DType, out_dtype: DType, //, pdl_level: PDLLevel = PDLLevel.ON](output: TileTensor[out_dtype, Storage=output.Storage, address_space=output.address_space, linear_idx_type=output.linear_idx_type, element_size=output.element_size], q: TileTensor[in_dtype, Storage=q.Storage, address_space=q.address_space, linear_idx_type=q.linear_idx_type, element_size=q.element_size], k: TileTensor[in_dtype, Storage=k.Storage, address_space=k.address_space, linear_idx_type=k.linear_idx_type, element_size=k.element_size], rows: Int, q_cols: Int, k_cols: Int, ctx: DeviceContext)

Launches the fused Q/K mean-of-squares reduction: one launch, grid (rows, 2).

block_idx.y selects Q (0) or K (1). Block dim is sized for the wider of the two operands; the narrower operand simply leaves trailing threads idle.