Mojo function
rms_norm_fused_residual_add_gpu_block_no_shmem
rms_norm_fused_residual_add_gpu_block_no_shmem[mut1: Bool, LayoutType1: TensorLayout, origin1: Origin[mut=mut1], mut2: Bool, LayoutType2: TensorLayout, origin2: Origin[mut=mut2], dtype: DType, //, simd_width: Int, max_warps_per_block: Int, input_fn: def[width: Int](row: Int, col: Int) capturing -> SIMD[dtype, width], residual_input_fn: def[width: Int](row: Int, col: Int) capturing -> SIMD[dtype, width], output_fn: def[width: Int, alignment: Int](row: Int, col: Int, val: SIMD[dtype, width]) capturing -> None, output_residual_fn: def[width: Int, alignment: Int](row: Int, col: Int, val: SIMD[dtype, width]) capturing -> None, multiply_before_cast: Bool](gamma1: TileTensor[dtype, LayoutType1, origin1], epsilon1: Scalar[dtype], weight_offset1: Scalar[dtype], gamma2: TileTensor[dtype, LayoutType2, origin2], epsilon2: Scalar[dtype], weight_offset2: Scalar[dtype], num_rows: Int, num_cols: Int)
RMS norm fused with residual add, without shared memory reductions.
Each warp independently processes one row using only warp-level
reductions (warp.sum), avoiding all shared memory usage. Multiple
rows are processed per block (one row per warp). Intermediate results
between stages are recomputed instead of being stored in shared memory,
trading extra global memory reads for zero shared memory usage.
This is particularly useful on Apple GPUs where shared memory capacity is limited.
Was this page helpful?
Thank you! We'll create more content like this.
Thank you for helping us improve!