For the complete documentation index, see llms.txt. Markdown versions of all pages are available by appending .md to any URL (e.g. /max/get-started.md).

Mojo function

dispatch_wait_kernel

def dispatch_wait_kernel[num_threads: Int, row_offsets_layout: TensorLayout, expert_ids_layout: TensorLayout, src_info_layout: TensorLayout, n_sms: Int, n_experts: Int, n_ranks: Int, max_tokens_per_rank: Int, token_fmt_type: TokenFormat, input_scales_wrapper: Optional[def[dtype: DType](Int) capturing thin -> Scalar[dtype]] = None](format_handler: token_fmt_type, row_offsets: TileTensor[DType.uint32, row_offsets_layout, MutUntrackedOrigin], expert_ids: TileTensor[DType.int32, expert_ids_layout, MutUntrackedOrigin], src_info: TileTensor[DType.int32, src_info_layout, MutUntrackedOrigin], recv_buf_p: Pointer[UInt8, MutUntrackedOrigin, _safe=False], recv_count_p: Pointer[UInt64, MutUntrackedOrigin, _safe=False], ep_counters: EPLocalSyncCounters[n_experts], my_rank: Int32)

This kernel is called after the dispatch_kernel to complete the communication. It will keep polling the receive count buffer, and once the count is no longer MAX_FINITE, it can confirm that the communication is complete from a remote rank.

The kernel will also aggregate the tokens from all the experts, and store them in the output tensor using a ragged representation.

Parameters:

num_threads (Int): The number of threads in the block.
row_offsets_layout (TensorLayout): The layout of the row offsets.
expert_ids_layout (TensorLayout): The layout of the expert IDs.
src_info_layout (TensorLayout): The layout of the source token info.
n_sms (Int): The total number of SMs in the device.
n_experts (Int): The number of experts in the device.
n_ranks (Int): The number of ranks.
max_tokens_per_rank (Int): The maximum number of tokens per rank.
token_fmt_type (TokenFormat): Type conforming to TokenFormat trait that defines the token encoding scheme.
input_scales_wrapper (Optional[def[dtype: DType](Int) capturing thin -> Scalar[dtype]]): The wrapper for the input scales.

Args:

format_handler (token_fmt_type): Instance of token_fmt_type that performs token decoding and manages output tensor writes.
row_offsets (TileTensor[DType.uint32, row_offsets_layout, MutUntrackedOrigin]): The row offsets to be updated. Will be consumed by the grouped_matmul kernel.
expert_ids (TileTensor[DType.int32, expert_ids_layout, MutUntrackedOrigin]): The expert IDs to be updated. Will be consumed by the grouped_matmul kernel.
src_info (TileTensor[DType.int32, src_info_layout, MutUntrackedOrigin]): The source token info to be updated. Once the expert computation is complete, tokens will be send back to the original rank using information in this tensor.
recv_buf_p (Pointer[UInt8, MutUntrackedOrigin, _safe=False]): The pointer to the receive buffer. Need to be allocated using shmem_alloc if use_shmem is True. The underlying buffer is of shape (n_local_experts, n_ranks, max_tokens_per_rank, msg_bytes).
recv_count_p (Pointer[UInt64, MutUntrackedOrigin, _safe=False]): The pointer to the receive count buffer. Need to be allocated using shmem_alloc if use_shmem is True. The underlying buffer is of shape (n_local_experts, n_ranks).
ep_counters (EPLocalSyncCounters[n_experts]): EP atomic counters for kernel synchronization.
my_rank (Int32): The rank of the current device.