For the complete documentation index, see llms.txt. Markdown versions of all pages are available by appending .md to any URL (e.g. /max/get-started.md).
Mojo module
ep_comm
comptime valuesβ
BLOCK_SCOPEβ
comptime BLOCK_SCOPE = _BLOCK_SCOPE()
DEVICE_SCOPEβ
comptime DEVICE_SCOPE = _DEVICE_SCOPE()
elementwise_epilogue_typeβ
comptime elementwise_epilogue_type = def[dtype: DType, width: Int, *, alignment: Int = 1](IndexList[2], SIMD[dtype, width]) capturing -> None
EP_DATA_READY_FLAGβ
comptime EP_DATA_READY_FLAG = 1024
input_scales_wrapper_typeβ
comptime input_scales_wrapper_type = def[dtype: DType](Int) capturing -> Scalar[dtype]
MAX_GPUS_PER_NODEβ
comptime MAX_GPUS_PER_NODE = 8
router_weights_wrapper_typeβ
comptime router_weights_wrapper_type = def[width: Int](token_idx: Int, topk_id: Int) capturing -> SIMD[DType.float32, width]
Structsβ
- β
BF16TokenFormat: - β
BlockwiseFP8TokenFormat: - β
EPCombineKernel: Implements combine_async and combine_wait kernel logic for Expert Parallelism. - β
EPDispatchKernel: Implements dispatch_async and dispatch_wait kernel logic for Expert Parallelism. - β
EPLocalSyncCounters: Manages atomic counters for EP kernel synchronization within a device. - β
MXFP4TokenFormat: - β
NVFP4TokenFormat:
Traitsβ
Functionsβ
- β
block_memcpy: Copies a memory area from source to destination. This function will use the vectorized store and load instructions to copy the memory area. User should make sure pointers are aligned to the simd width. - β
block_prefix_sum: Performs a prefix sum (scan) operation across all threads in a block. - β
combine_async_kernel: Send tokens to the original rank based on the src_info tensor. This kernel utilizes the non-blocking SHMEM API, and would return immediately after initiating the communication. The communication is considered complete after calling thecombine_wait_kernel. - β
combine_kernel: Fused combine kernel that combines combine_async and combine_wait functionality in a single kernel launch. - β
combine_wait_kernel: This kernel is called after thecombine_kernelto complete the communication. It will keep polling the receive count buffer, and once the count is no longer MAX_FINITE, it can confirm that the communication is complete from a remote rank. - β
dispatch_async_kernel: Dispatch tokens to experts on remote ranks based on the top-k expert IDs. This kernel utilizes the non-blocking SHMEM API ifuse_shmemis True, and would return immediately after initiating the communication. The communication is considered complete after calling thedispatch_wait_kernel. - β
dispatch_kernel: Fused dispatch kernel that combines dispatch_async and dispatch_wait functionality in a single kernel launch. - β
dispatch_wait_kernel: This kernel is called after thedispatch_kernelto complete the communication. It will keep polling the receive count buffer, and once the count is no longer MAX_FINITE, it can confirm that the communication is complete from a remote rank. - β
ep_signal_completion: Signals the completion of the communication by writing to the receive count buffer. Will use direct memory access if the target device is on the same node, and use the SHMEM API if the target device is on a different node. - β
fused_silu_fp8_kernel: This kernel performs the SILU operation for all the MLPs in the EP MoE module. We need to manually implement the kernel here is because after the EP dispatch phase, the actual number of received tokens is not known to the host. This kernel will read the row offsets to determine the actual number of received tokens in the input tensor. - β
fused_silu_kernel: This kernel performs the SILU operation for all the MLPs in the EP MoE module. We need to manually implement the kernel here is because after the EP dispatch phase, the actual number of received tokens is not known to the host. This kernel will read the row offsets to determine the actual number of received tokens in the input tensor. - β
fused_silu_mxfp4_kernel: This kernel performs the SILU operation for all the MLPs in the EP MoE module. We need to manually implement the kernel here is because after the EP dispatch phase, the actual number of received tokens is not known to the host. This kernel will read the row offsets to determine the actual number of received tokens in the input tensor. - β
fused_silu_nvfp4_interleaved_kernel: SwiGLU + NVFP4 quantization for interleaved gate/up layout. - β
fused_silu_nvfp4_kernel: This kernel performs the SILU operation for all the MLPs in the EP MoE module. We need to manually implement the kernel here is because after the EP dispatch phase, the actual number of received tokens is not known to the host. This kernel will read the row offsets to determine the actual number of received tokens in the input tensor. - β
get_device_alignment:
Was this page helpful?
Thank you! We'll create more content like this.
Thank you for helping us improve!