For the complete documentation index, see llms.txt. Markdown versions of all pages are available by appending .md to any URL (e.g. /max/get-started.md).

Mojo function

combine_wait_kernel

def combine_wait_kernel[output_type: DType, num_threads: Int, output_tokens_layout: TensorLayout, n_sms: Int, top_k: Int, n_experts: Int, n_ranks: Int, msg_bytes: Int, max_tokens_per_rank: Int, router_weights_wrapper: Optional[def[width: Int](token_idx: Int, topk_id: Int) capturing thin -> SIMD[DType.float32, width]] = None, elementwise_lambda_fn: Optional[def[dtype: DType, width: SIMDLength, *, alignment: Int = Int(1)](IndexList[Int(2)], SIMD[dtype, width]) capturing thin -> None] = None](output_tokens: TileTensor[output_type, output_tokens_layout, MutUntrackedOrigin], recv_buf_p: Pointer[UInt8, MutUntrackedOrigin, _safe=False], recv_count_p: Pointer[UInt64, MutUntrackedOrigin, _safe=False], ep_counters: EPLocalSyncCounters[n_experts], my_rank: Int32)

This kernel is called after the combine_kernel to complete the communication. It will keep polling the receive count buffer, and once the count is no longer MAX_FINITE, it can confirm that the communication is complete from a remote rank.

Parameters:

output_type (DType): The type of the output tokens.
num_threads (Int): The number of threads in the block.
output_tokens_layout (TensorLayout): The layout of the output tokens.
n_sms (Int): The total number of SMs in the device.
top_k (Int): The number of selected experts per token.
n_experts (Int): The number of experts in the device.
n_ranks (Int): The number of ranks.
msg_bytes (Int): The number of bytes in the message for each token.
max_tokens_per_rank (Int): The maximum number of tokens per rank.
router_weights_wrapper (Optional[def[width: Int](token_idx: Int, topk_id: Int) capturing thin -> SIMD[DType.float32, width]]): The wrapper for the router weights. If provided, all routed experts' outputs for a token will be weighted and summed.
elementwise_lambda_fn (Optional[def[dtype: DType, width: SIMDLength, *, alignment: Int = Int(1)](IndexList[Int(2)], SIMD[dtype, width]) capturing thin -> None]): Optional output lambda function.

Args:

output_tokens (TileTensor[output_type, output_tokens_layout, MutUntrackedOrigin]): The tensor to store the output tokens.
recv_buf_p (Pointer[UInt8, MutUntrackedOrigin, _safe=False]): The pointer to the receive buffer. Need to be allocated using shmem_alloc. The underlying buffer is of shape (max_tokens_per_rank, top_k, msg_bytes).
recv_count_p (Pointer[UInt64, MutUntrackedOrigin, _safe=False]): The pointer to the receive count buffer. Need to be allocated using shmem_alloc if use_shmem is True. The underlying buffer is of shape (n_local_experts, n_ranks).
ep_counters (EPLocalSyncCounters[n_experts]): EP atomic counters for kernel synchronization.
my_rank (Int32): The rank of the current device.