For the complete documentation index, see llms.txt. Markdown versions of all pages are available by appending .md to any URL (e.g. /get-started.md).

Mojo function

fused_silu_mxfp8_interleaved_kernel

def fused_silu_mxfp8_interleaved_kernel[fp8_dtype: DType, scales_dtype: DType, input_dtype: DType, output_layout: TensorLayout, scales_layout: TensorLayout, input_layout: TensorLayout, offsets_layout: TensorLayout, scales_offsets_layout: TensorLayout, num_threads: Int, num_sms: Int, clamp_activation: Bool = False](output_tensor: TileTensor[fp8_dtype, output_layout, MutUntrackedOrigin], scales_tensor: TileTensor[scales_dtype, scales_layout, MutUntrackedOrigin], input_tensor: TileTensor[input_dtype, input_layout, ImmutUntrackedOrigin], row_offsets: TileTensor[DType.uint32, offsets_layout, ImmutUntrackedOrigin], scales_offsets: TileTensor[DType.uint32, scales_offsets_layout, ImmutUntrackedOrigin], alpha: Float32 = 0, limit: Float32 = 0)

SwiGLU + MXFP8 quantization for interleaved gate/up layout.

MXFP8 counterpart of fused_silu_nvfp4_interleaved_kernel. Consumes BF16 inputs in the [gate_0, up_0, gate_1, up_1, ...] interleaved layout produced by permuting the MoE up-projection weight on the N axis with sigma(2i)=i, sigma(2i+1)=H+i, applies SiLU(gate)*up, and quantizes the result per 32-element block to FP8-E4M3 with FP8-UE8M0 block scales.

Compared with the NVFP4 variant the differences are:

Output is one fp8_e4m3fn byte per element (vs. 2 fp4 nibbles per byte for NVFP4), so output_dim == hidden_size.
Block size is MXFP8_SF_VECTOR_SIZE = 32 (vs. 16) and the per- block scale uses 4 cooperating threads via lane_group_max (vs. 2 for NVFP4).
Scale dtype is float8_e8m0fnu (E8M0, power-of-2 only) and the scale value is block_max / 448 (FP8-E4M3 max abs), rounded to the nearest power of two by the cast to E8M0.
No per-expert input_scales (tensor_sf) folded into the scale: E8M0 cannot represent non-power-of-2 multipliers without precision loss; MXFP8 activations use the per-block scale only. Caller's c_input_scales is therefore unused here and not plumbed through; the fused in-tile path matches the same contract.
5D scale tile layout is identical to NVFP4; only the SF_VECTOR_SIZE divisor passed to set_scale_factor changes.