For the complete documentation index, see llms.txt. Markdown versions of all pages are available by appending .md to any URL (e.g. /max/get-started.md).

Mojo module

ep_comm

Expert-parallelism (EP) communication kernels for MoE token dispatch and combine.

Implements the token dispatch (scatter) and combine (gather) collectives used in Mixture-of-Experts inference across multiple GPUs, including FP8 and FP4 quantization paths for bandwidth-limited transfers.

`comptime` values

`BLOCK_SCOPE`

comptime BLOCK_SCOPE = _BLOCK_SCOPE()

`DEVICE_SCOPE`

comptime DEVICE_SCOPE = _DEVICE_SCOPE()

`elementwise_epilogue_type`

comptime elementwise_epilogue_type = def[dtype: DType, width: SIMDLength, *, alignment: Int = Int(1)](IndexList[Int(2)], SIMD[dtype, width]) capturing thin -> None

`EP_DATA_READY_FLAG`

comptime EP_DATA_READY_FLAG = 1024

`input_scales_wrapper_type`

comptime input_scales_wrapper_type = def[dtype: DType](Int) capturing thin -> Scalar[dtype]

`MAX_GPUS_PER_NODE`

comptime MAX_GPUS_PER_NODE = 8

`router_weights_wrapper_type`

comptime router_weights_wrapper_type = def[width: Int](token_idx: Int, topk_id: Int) capturing thin -> SIMD[DType.float32, width]

Structs

BF16TokenFormat: Token format that transmits the full hidden state in BFloat16.
BlockwiseFP8TokenFormat: Token format that quantizes the hidden state to FP8 with block-wise scales.
EPCombineKernel: Implements combine_async and combine_wait kernel logic for Expert Parallelism.
EPDispatchKernel: Implements dispatch_async and dispatch_wait kernel logic for Expert Parallelism.
EPLocalSyncCounters: Manages atomic counters for EP kernel synchronization within a device.
MXFP4TokenFormat: Token format for MX (microscaling) FP4 quantization.
NVBlockScaledTokenFormat: Token format for NVIDIA block-scaled FP4/FP8 quantization.

Traits

TokenFormat: Specifies the wire format for a single MoE token in EP dispatch/combine.

Functions

block_memcpy: Copies a memory area from source to destination. This function will use the vectorized store and load instructions to copy the memory area. User should make sure pointers are aligned to the simd width.
block_prefix_sum: Performs a prefix sum (scan) operation across all threads in a block.
combine_async_kernel: Send tokens to the original rank based on the src_info tensor. This kernel utilizes the non-blocking SHMEM API, and would return immediately after initiating the communication. The communication is considered complete after calling the combine_wait_kernel.
combine_kernel: Fused combine kernel that combines combine_async and combine_wait functionality in a single kernel launch.
combine_wait_kernel: This kernel is called after the combine_kernel to complete the communication. It will keep polling the receive count buffer, and once the count is no longer MAX_FINITE, it can confirm that the communication is complete from a remote rank.
dispatch_async_kernel: Dispatch tokens to experts on remote ranks based on the top-k expert IDs. This kernel utilizes the non-blocking SHMEM API if use_shmem is True, and would return immediately after initiating the communication. The communication is considered complete after calling the dispatch_wait_kernel.
dispatch_kernel: Fused dispatch kernel that combines dispatch_async and dispatch_wait functionality in a single kernel launch.
dispatch_wait_kernel: This kernel is called after the dispatch_kernel to complete the communication. It will keep polling the receive count buffer, and once the count is no longer MAX_FINITE, it can confirm that the communication is complete from a remote rank.
ep_signal_completion: Signals the completion of the communication by writing to the receive count buffer. Will use direct memory access if the target device is on the same node, and use the SHMEM API if the target device is on a different node.
fused_silu_fp8_kernel: This kernel performs the SILU operation for all the MLPs in the EP MoE module. We need to manually implement the kernel here is because after the EP dispatch phase, the actual number of received tokens is not known to the host. This kernel will read the row offsets to determine the actual number of received tokens in the input tensor.
fused_silu_kernel: This kernel performs the SILU operation for all the MLPs in the EP MoE module. We need to manually implement the kernel here is because after the EP dispatch phase, the actual number of received tokens is not known to the host. This kernel will read the row offsets to determine the actual number of received tokens in the input tensor.
fused_silu_mxfp4_kernel: This kernel performs the SILU operation for all the MLPs in the EP MoE module. We need to manually implement the kernel here is because after the EP dispatch phase, the actual number of received tokens is not known to the host. This kernel will read the row offsets to determine the actual number of received tokens in the input tensor.
fused_silu_mxfp8_interleaved_kernel: SwiGLU + MXFP8 quantization for interleaved gate/up layout.
fused_silu_nvfp4_interleaved_kernel: SwiGLU + NVFP4 quantization for interleaved gate/up layout.
fused_silu_nvfp4_kernel: This kernel performs the SILU operation for all the MLPs in the EP MoE module. We need to manually implement the kernel here is because after the EP dispatch phase, the actual number of received tokens is not known to the host. This kernel will read the row offsets to determine the actual number of received tokens in the input tensor.
get_device_alignment: Returns the natural SIMD alignment in bytes for the current GPU target.

comptime values​

BLOCK_SCOPE​

DEVICE_SCOPE​

elementwise_epilogue_type​

EP_DATA_READY_FLAG​

input_scales_wrapper_type​

MAX_GPUS_PER_NODE​

router_weights_wrapper_type​

Structs​

Traits​

Functions​