IMPORTANT: To view this page as Markdown, append `.md` to the URL (e.g. /max/get-started.md). For the complete documentation index, see llms.txt.
Skip to main content
For the complete documentation index, see llms.txt. Markdown versions of all pages are available by appending .md to any URL (e.g. /max/get-started.md).

Mojo function

fused_silu_nvfp4_interleaved_kernel

fused_silu_nvfp4_interleaved_kernel[fp4_dtype: DType, scales_dtype: DType, input_dtype: DType, output_layout: TensorLayout, scales_layout: TensorLayout, input_layout: TensorLayout, offsets_layout: TensorLayout, scales_offsets_layout: TensorLayout, input_scales_layout: TensorLayout, num_threads: Int, num_sms: Int](output_tensor: TileTensor[fp4_dtype, output_layout, MutExternalOrigin], scales_tensor: TileTensor[scales_dtype, scales_layout, MutExternalOrigin], input_tensor: TileTensor[input_dtype, input_layout, ImmutExternalOrigin], row_offsets: TileTensor[DType.uint32, offsets_layout, ImmutExternalOrigin], scales_offsets: TileTensor[DType.uint32, scales_offsets_layout, ImmutExternalOrigin], input_scales: TileTensor[DType.float32, input_scales_layout, ImmutExternalOrigin])

SwiGLU + NVFP4 quantization for interleaved gate/up layout.

Variant of fused_silu_nvfp4_kernel that consumes inputs in the [gate_0, up_0, gate_1, up_1, ...] interleaved layout produced by permuting the MoE up-projection weight on the N axis with σ(2i)=i, σ(2i+1)=H+i. Used by grouped_matmul_swiglu_nvfp4_dispatch's fallback path for tile sizes that cannot fuse SwiGLU+quant in the matmul epilogue (BN < 32).

The only difference from fused_silu_nvfp4_kernel is the load pattern in the inner loop: instead of loading gate from [k, k+8) and up from [k+H, k+H+8), this loads a 16-wide chunk at [2k, 2k+16) and stride-2 splits it into gate (even lanes) and up (odd lanes). All downstream steps (SwiGLU, two-thread-per-SF reduction, scale math, packed nibble store, trailing zero-pad to SF_MN_GROUP_SIZE) are identical.