For the complete documentation index, see llms.txt. Markdown versions of all pages are available by appending .md to any URL (e.g. /max/get-started.md).

Mojo module

normalization

Functions

apply_qk_rms_norm: Fused per-element QK-RMSNorm apply for two operands Q and K.
apply_qk_rms_norm_cpu: Naive CPU reference path (also used as a correctness oracle).
apply_qk_rms_norm_gpu: Launches the fused Q/K RMSNorm apply: one launch, grid (rows, 2).
apply_qk_rms_norm_gpu_block: Fused per-element QK-RMSNorm apply for Q and K in a single launch.
block_reduce:
block_reduce_dual_sum: Combined block reduction for two sums using only 2 barriers.
group_norm:
group_norm_gpu:
group_norm_gpu_block:
group_norm_gpu_multi_block_norm: Multi-block normalize kernel: reduces partial stats and normalizes.
group_norm_gpu_multi_block_stats: Multi-block stats kernel: computes partial Welford statistics per split.
group_norm_gpu_warp_tiling:
group_norm_reshape: Reshapes an input buffer for group normalization by flattening all dimensions except the group dimension. Returns a 2D buffer of shape (num_groups * N, group_size), where group_size is the product of channels_per_group and spatial.
layer_norm:
layer_norm_cpu: Computes layernorm(elementwise_fn(x)) across the last dimension of x, where layernorm is defined as $(x-mean(x))/(sqrt(var(x)+eps)*gamma_fn + beta$ .
layer_norm_gpu:
layer_norm_gpu_block:
layer_norm_gpu_warp_tiling:
layer_norm_reshape:
layer_norm_shape: Compute the output shape of a layer_norm operation.
rms_norm:
rms_norm_cpu:
rms_norm_fused_residual_add:
rms_norm_fused_residual_add_cpu:
rms_norm_fused_residual_add_gpu:
rms_norm_fused_residual_add_gpu_block:
rms_norm_fused_residual_add_gpu_block_no_shmem: RMS norm fused with residual add, without shared memory reductions.
rms_norm_fused_residual_add_gpu_warp_tiling:
rms_norm_gpu:
rms_norm_gpu_block:
rms_norm_gpu_warp_tiling:
rms_norm_gpu_warp_tiling_128:
rms_norm_rope_gpu: Fused RMS normalization followed by Rotary Position Embedding (RoPE) for GPU.
row_mean_of_squares: Per-row mean of squares over the last axis, accumulated in accum_type.
row_mean_of_squares_cpu: Naive CPU reference path (also used as a correctness oracle).
row_mean_of_squares_gpu: Launches the GPU mean-of-squares reduction: one block per row.
row_mean_of_squares_gpu_block:
row_mean_of_squares_qk: Fused per-row mean of squares for two operands Q and K.
row_mean_of_squares_qk_cpu: Naive CPU reference path (also used as a correctness oracle).
row_mean_of_squares_qk_gpu: Launches the fused Q/K mean-of-squares reduction: one launch, grid (rows, 2).
row_mean_of_squares_qk_gpu_block: Fused per-row mean of squares for Q and K in a single launch.
welford_block_all_reduce:
welford_combine:
welford_update:
welford_warp_reduce:

Functions​

Functions