For the complete documentation index, see llms.txt. Markdown versions of all pages are available by appending .md to any URL (e.g. /max/get-started.md).

Mojo function

fused_silu_nvfp4_interleaved_kernel

def fused_silu_nvfp4_interleaved_kernel[fp4_dtype: DType, scales_dtype: DType, input_dtype: DType, output_layout: TensorLayout, scales_layout: TensorLayout, input_layout: TensorLayout, offsets_layout: TensorLayout, scales_offsets_layout: TensorLayout, input_scales_layout: TensorLayout, num_threads: Int, num_sms: Int](output_tensor: TileTensor[fp4_dtype, output_layout, MutUntrackedOrigin], scales_tensor: TileTensor[scales_dtype, scales_layout, MutUntrackedOrigin], input_tensor: TileTensor[input_dtype, input_layout, ImmUntrackedOrigin], row_offsets: TileTensor[DType.uint32, offsets_layout, ImmUntrackedOrigin], scales_offsets: TileTensor[DType.uint32, scales_offsets_layout, ImmUntrackedOrigin], input_scales: TileTensor[DType.float32, input_scales_layout, ImmUntrackedOrigin])

SwiGLU + NVFP4 quantization for interleaved gate/up layout.

Variant of fused_silu_nvfp4_kernel that consumes inputs in the [gate_0, up_0, gate_1, up_1, ...] interleaved layout produced by permuting the MoE up-projection weight on the N axis with σ(2i)=i, σ(2i+1)=H+i. Used by grouped_matmul_swiglu_nvfp4_dispatch's fallback path for tile sizes that cannot fuse SwiGLU+quant in the matmul epilogue (BN < 32).

The only difference from fused_silu_nvfp4_kernel is the load pattern in the inner loop: instead of loading gate from [k, k+8) and up from [k+H, k+H+8), this loads a 16-wide chunk at [2k, 2k+16) and stride-2 splits it into gate (even lanes) and up (odd lanes). All downstream steps (SwiGLU, two-thread-per-SF reduction, scale math, packed nibble store, trailing zero-pad to SF_MN_GROUP_SIZE) are identical.