IMPORTANT: To view this page as Markdown, append `.md` to the URL (e.g. /max/get-started.md). For the complete documentation index, see llms.txt.
Skip to main content
For the complete documentation index, see llms.txt. Markdown versions of all pages are available by appending .md to any URL (e.g. /max/get-started.md).

Mojo module

ep_comm

comptime values​

BLOCK_SCOPE​

comptime BLOCK_SCOPE = _BLOCK_SCOPE()

DEVICE_SCOPE​

comptime DEVICE_SCOPE = _DEVICE_SCOPE()

elementwise_epilogue_type​

comptime elementwise_epilogue_type = def[dtype: DType, width: Int, *, alignment: Int = 1](IndexList[2], SIMD[dtype, width]) capturing -> None

EP_DATA_READY_FLAG​

comptime EP_DATA_READY_FLAG = 1024

input_scales_wrapper_type​

comptime input_scales_wrapper_type = def[dtype: DType](Int) capturing -> Scalar[dtype]

MAX_GPUS_PER_NODE​

comptime MAX_GPUS_PER_NODE = 8

router_weights_wrapper_type​

comptime router_weights_wrapper_type = def[width: Int](token_idx: Int, topk_id: Int) capturing -> SIMD[DType.float32, width]

Structs​

Traits​

Functions​

  • ​block_memcpy: Copies a memory area from source to destination. This function will use the vectorized store and load instructions to copy the memory area. User should make sure pointers are aligned to the simd width.
  • ​block_prefix_sum: Performs a prefix sum (scan) operation across all threads in a block.
  • ​combine_async_kernel: Send tokens to the original rank based on the src_info tensor. This kernel utilizes the non-blocking SHMEM API, and would return immediately after initiating the communication. The communication is considered complete after calling the combine_wait_kernel.
  • ​combine_kernel: Fused combine kernel that combines combine_async and combine_wait functionality in a single kernel launch.
  • ​combine_wait_kernel: This kernel is called after the combine_kernel to complete the communication. It will keep polling the receive count buffer, and once the count is no longer MAX_FINITE, it can confirm that the communication is complete from a remote rank.
  • ​dispatch_async_kernel: Dispatch tokens to experts on remote ranks based on the top-k expert IDs. This kernel utilizes the non-blocking SHMEM API if use_shmem is True, and would return immediately after initiating the communication. The communication is considered complete after calling the dispatch_wait_kernel.
  • ​dispatch_kernel: Fused dispatch kernel that combines dispatch_async and dispatch_wait functionality in a single kernel launch.
  • ​dispatch_wait_kernel: This kernel is called after the dispatch_kernel to complete the communication. It will keep polling the receive count buffer, and once the count is no longer MAX_FINITE, it can confirm that the communication is complete from a remote rank.
  • ​ep_signal_completion: Signals the completion of the communication by writing to the receive count buffer. Will use direct memory access if the target device is on the same node, and use the SHMEM API if the target device is on a different node.
  • ​fused_silu_fp8_kernel: This kernel performs the SILU operation for all the MLPs in the EP MoE module. We need to manually implement the kernel here is because after the EP dispatch phase, the actual number of received tokens is not known to the host. This kernel will read the row offsets to determine the actual number of received tokens in the input tensor.
  • ​fused_silu_kernel: This kernel performs the SILU operation for all the MLPs in the EP MoE module. We need to manually implement the kernel here is because after the EP dispatch phase, the actual number of received tokens is not known to the host. This kernel will read the row offsets to determine the actual number of received tokens in the input tensor.
  • ​fused_silu_mxfp4_kernel: This kernel performs the SILU operation for all the MLPs in the EP MoE module. We need to manually implement the kernel here is because after the EP dispatch phase, the actual number of received tokens is not known to the host. This kernel will read the row offsets to determine the actual number of received tokens in the input tensor.
  • ​fused_silu_nvfp4_interleaved_kernel: SwiGLU + NVFP4 quantization for interleaved gate/up layout.
  • ​fused_silu_nvfp4_kernel: This kernel performs the SILU operation for all the MLPs in the EP MoE module. We need to manually implement the kernel here is because after the EP dispatch phase, the actual number of received tokens is not known to the host. This kernel will read the row offsets to determine the actual number of received tokens in the input tensor.
  • ​get_device_alignment: