For the complete documentation index, see llms.txt. Markdown versions of all pages are available by appending .md to any URL (e.g. /max/get-started.md).
Mojo function
fused_silu_mxfp8_interleaved_kernel
def fused_silu_mxfp8_interleaved_kernel[fp8_dtype: DType, scales_dtype: DType, input_dtype: DType, output_layout: TensorLayout, scales_layout: TensorLayout, input_layout: TensorLayout, offsets_layout: TensorLayout, scales_offsets_layout: TensorLayout, num_threads: Int, num_sms: Int, clamp_activation: Bool = False](output_tensor: TileTensor[fp8_dtype, output_layout, MutUntrackedOrigin], scales_tensor: TileTensor[scales_dtype, scales_layout, MutUntrackedOrigin], input_tensor: TileTensor[input_dtype, input_layout, ImmutUntrackedOrigin], row_offsets: TileTensor[DType.uint32, offsets_layout, ImmutUntrackedOrigin], scales_offsets: TileTensor[DType.uint32, scales_offsets_layout, ImmutUntrackedOrigin], alpha: Float32 = 0, limit: Float32 = 0)
SwiGLU + MXFP8 quantization for interleaved gate/up layout.
MXFP8 counterpart of fused_silu_nvfp4_interleaved_kernel. Consumes
BF16 inputs in the [gate_0, up_0, gate_1, up_1, ...] interleaved
layout produced by permuting the MoE up-projection weight on the N
axis with sigma(2i)=i, sigma(2i+1)=H+i, applies SiLU(gate)*up, and
quantizes the result per 32-element block to FP8-E4M3 with FP8-UE8M0
block scales.
Compared with the NVFP4 variant the differences are:
- Output is one fp8_e4m3fn byte per element (vs. 2 fp4 nibbles
per byte for NVFP4), so
output_dim == hidden_size. - Block size is
MXFP8_SF_VECTOR_SIZE = 32(vs. 16) and the per- block scale uses 4 cooperating threads vialane_group_max(vs. 2 for NVFP4). - Scale dtype is
float8_e8m0fnu(E8M0, power-of-2 only) and the scale value isblock_max / 448(FP8-E4M3 max abs), rounded to the nearest power of two by the cast to E8M0. - No per-expert
input_scales(tensor_sf) folded into the scale: E8M0 cannot represent non-power-of-2 multipliers without precision loss; MXFP8 activations use the per-block scale only. Caller'sc_input_scalesis therefore unused here and not plumbed through; the fused in-tile path matches the same contract. - 5D scale tile layout is identical to NVFP4; only the
SF_VECTOR_SIZEdivisor passed toset_scale_factorchanges.
Was this page helpful?
Thank you! We'll create more content like this.
Thank you for helping us improve!