For the complete documentation index, see llms.txt. Markdown versions of all pages are available by appending .md to any URL (e.g. /max/get-started.md).
Mojo function
row_mean_of_squares_qk_gpu
def row_mean_of_squares_qk_gpu[in_dtype: DType, out_dtype: DType, //, pdl_level: PDLLevel = PDLLevel.ON](output: TileTensor[out_dtype, Storage=output.Storage, address_space=output.address_space, linear_idx_type=output.linear_idx_type, element_size=output.element_size], q: TileTensor[in_dtype, Storage=q.Storage, address_space=q.address_space, linear_idx_type=q.linear_idx_type, element_size=q.element_size], k: TileTensor[in_dtype, Storage=k.Storage, address_space=k.address_space, linear_idx_type=k.linear_idx_type, element_size=k.element_size], rows: Int, q_cols: Int, k_cols: Int, ctx: DeviceContext)
Launches the fused Q/K mean-of-squares reduction: one launch, grid (rows, 2).
block_idx.y selects Q (0) or K (1). Block dim is sized for the wider of
the two operands; the narrower operand simply leaves trailing threads idle.
Was this page helpful?
Thank you! We'll create more content like this.
Thank you for helping us improve!