IMPORTANT: To view this page as Markdown, append `.md` to the URL (e.g. /max/get-started.md). For the complete documentation index, see llms.txt.
Skip to main content
For the complete documentation index, see llms.txt. Markdown versions of all pages are available by appending .md to any URL (e.g. /max/get-started.md).

Mojo function

fused_silu_mxfp8_interleaved_kernel

def fused_silu_mxfp8_interleaved_kernel[fp8_dtype: DType, scales_dtype: DType, input_dtype: DType, output_layout: TensorLayout, scales_layout: TensorLayout, input_layout: TensorLayout, offsets_layout: TensorLayout, scales_offsets_layout: TensorLayout, num_threads: Int, num_sms: Int, clamp_activation: Bool = False](output_tensor: TileTensor[fp8_dtype, output_layout, MutUntrackedOrigin], scales_tensor: TileTensor[scales_dtype, scales_layout, MutUntrackedOrigin], input_tensor: TileTensor[input_dtype, input_layout, ImmutUntrackedOrigin], row_offsets: TileTensor[DType.uint32, offsets_layout, ImmutUntrackedOrigin], scales_offsets: TileTensor[DType.uint32, scales_offsets_layout, ImmutUntrackedOrigin], alpha: Float32 = 0, limit: Float32 = 0)

SwiGLU + MXFP8 quantization for interleaved gate/up layout.

MXFP8 counterpart of fused_silu_nvfp4_interleaved_kernel. Consumes BF16 inputs in the [gate_0, up_0, gate_1, up_1, ...] interleaved layout produced by permuting the MoE up-projection weight on the N axis with sigma(2i)=i, sigma(2i+1)=H+i, applies SiLU(gate)*up, and quantizes the result per 32-element block to FP8-E4M3 with FP8-UE8M0 block scales.

Compared with the NVFP4 variant the differences are:

  • Output is one fp8_e4m3fn byte per element (vs. 2 fp4 nibbles per byte for NVFP4), so output_dim == hidden_size.
  • Block size is MXFP8_SF_VECTOR_SIZE = 32 (vs. 16) and the per- block scale uses 4 cooperating threads via lane_group_max (vs. 2 for NVFP4).
  • Scale dtype is float8_e8m0fnu (E8M0, power-of-2 only) and the scale value is block_max / 448 (FP8-E4M3 max abs), rounded to the nearest power of two by the cast to E8M0.
  • No per-expert input_scales (tensor_sf) folded into the scale: E8M0 cannot represent non-power-of-2 multipliers without precision loss; MXFP8 activations use the per-block scale only. Caller's c_input_scales is therefore unused here and not plumbed through; the fused in-tile path matches the same contract.
  • 5D scale tile layout is identical to NVFP4; only the SF_VECTOR_SIZE divisor passed to set_scale_factor changes.