For the complete documentation index, see llms.txt. Markdown versions of all pages are available by appending .md to any URL (e.g. /max/get-started.md).
Mojo function
fused_silu_nvfp4_interleaved_kernel
fused_silu_nvfp4_interleaved_kernel[fp4_dtype: DType, scales_dtype: DType, input_dtype: DType, output_layout: TensorLayout, scales_layout: TensorLayout, input_layout: TensorLayout, offsets_layout: TensorLayout, scales_offsets_layout: TensorLayout, input_scales_layout: TensorLayout, num_threads: Int, num_sms: Int](output_tensor: TileTensor[fp4_dtype, output_layout, MutExternalOrigin], scales_tensor: TileTensor[scales_dtype, scales_layout, MutExternalOrigin], input_tensor: TileTensor[input_dtype, input_layout, ImmutExternalOrigin], row_offsets: TileTensor[DType.uint32, offsets_layout, ImmutExternalOrigin], scales_offsets: TileTensor[DType.uint32, scales_offsets_layout, ImmutExternalOrigin], input_scales: TileTensor[DType.float32, input_scales_layout, ImmutExternalOrigin])
SwiGLU + NVFP4 quantization for interleaved gate/up layout.
Variant of fused_silu_nvfp4_kernel that consumes inputs in the
[gate_0, up_0, gate_1, up_1, ...] interleaved layout produced by
permuting the MoE up-projection weight on the N axis with
σ(2i)=i, σ(2i+1)=H+i. Used by grouped_matmul_swiglu_nvfp4_dispatch's
fallback path for tile sizes that cannot fuse SwiGLU+quant in the
matmul epilogue (BN < 32).
The only difference from fused_silu_nvfp4_kernel is the load pattern
in the inner loop: instead of loading gate from [k, k+8) and up from
[k+H, k+H+8), this loads a 16-wide chunk at [2k, 2k+16) and
stride-2 splits it into gate (even lanes) and up (odd lanes). All
downstream steps (SwiGLU, two-thread-per-SF reduction, scale math,
packed nibble store, trailing zero-pad to SF_MN_GROUP_SIZE) are
identical.
Was this page helpful?
Thank you! We'll create more content like this.
Thank you for helping us improve!