For the complete documentation index, see llms.txt. Markdown versions of all pages are available by appending .md to any URL (e.g. /max/get-started.md).

Mojo function

fused_silu_nvfp4_kernel

def fused_silu_nvfp4_kernel[fp4_dtype: DType, scales_dtype: DType, input_dtype: DType, output_layout: TensorLayout, scales_layout: TensorLayout, input_layout: TensorLayout, offsets_layout: TensorLayout, scales_offsets_layout: TensorLayout, input_scales_layout: TensorLayout, num_threads: Int, num_sms: Int](output_tensor: TileTensor[fp4_dtype, output_layout, MutUntrackedOrigin], scales_tensor: TileTensor[scales_dtype, scales_layout, MutUntrackedOrigin], input_tensor: TileTensor[input_dtype, input_layout, ImmUntrackedOrigin], row_offsets: TileTensor[DType.uint32, offsets_layout, ImmUntrackedOrigin], scales_offsets: TileTensor[DType.uint32, scales_offsets_layout, ImmUntrackedOrigin], input_scales: TileTensor[DType.float32, input_scales_layout, ImmUntrackedOrigin])

This kernel performs the SILU operation for all the MLPs in the EP MoE module. We need to manually implement the kernel here is because after the EP dispatch phase, the actual number of received tokens is not known to the host. This kernel will read the row offsets to determine the actual number of received tokens in the input tensor.

Once the SILU operation is performed, the output tensor will be quantized to the NVFP4 format. The scales tensor will be padded and zero-filled.

Arguments: output_tensor: The output tensor to store the result. scales_tensor: The tensor to store the scales. input_tensor: The input tensor to perform the SILU operation. row_offsets: The row offsets to determine the actual number of received tokens. scales_offsets: The offsets to determine the position of the scales tiles. input_scales: Per-expert input scale factors.