For the complete documentation index, see llms.txt. Markdown versions of all pages are available by appending .md to any URL (e.g. /max/get-started.md).

Mojo struct

Struct_ep_combine_wait

struct Struct_ep_combine_wait

Registers the ep.combine_wait graph op with the graph compiler.

Implemented traits

AnyType, ImplicitlyDeletable

Methods

`execute`

static def execute[combine_dtype: DType, router_weights_dtype: DType, //, hidden_size: Int, top_k: Int, n_experts: Int, max_token_per_rank: Int, n_gpus_per_node: Int, n_nodes: Int, has_epilogue_fusion: Bool, target: StringSlice[ImmStaticOrigin]](output_tokens: ManagedTensorSlice[IOSpec[_, _].FusedOutput, static_spec=output_tokens.static_spec], atomic_counters: ManagedTensorSlice[IOSpec[_, _].MutableInput, static_spec=atomic_counters.static_spec], recv_ptrs: ManagedTensorSlice[IOSpec[_, _].Input, static_spec=recv_ptrs.static_spec], recv_count_ptrs: ManagedTensorSlice[IOSpec[_, _].Input, static_spec=recv_count_ptrs.static_spec], router_weights: ManagedTensorSlice[IOSpec[_, _].Input, static_spec=router_weights.static_spec], context: DeviceContext)

Execute the Expert Parallelism combine completion kernel.

Waits for incoming expert outputs and computes the weighted sum of routed expert outputs for each token using the router weights.

Parameters:

combine_dtype (DType): DType used for the token payload during the combine phase (inferred).
router_weights_dtype (DType): DType of the router weights tensor used to compute the weighted sum (inferred).
hidden_size (Int): Size of the model's hidden dimension.
top_k (Int): Number of experts each token is routed to.
n_experts (Int): Total number of experts across all GPUs.
max_token_per_rank (Int): Maximum number of tokens any GPU can receive.
n_gpus_per_node (Int): Number of GPUs per physical node.
n_nodes (Int): Number of physical nodes in the deployment.
has_epilogue_fusion (Bool): Whether to apply an elementwise epilogue function after computing the combined output.
target (StringSlice[ImmStaticOrigin]): Compile-time device target.

Args:

output_tokens (ManagedTensorSlice[IOSpec[_, _].FusedOutput, static_spec=output_tokens.static_spec]): Fused output tensor storing the weighted sum of routed expert outputs for each token. Shape [num_tokens, hidden_size].
atomic_counters (ManagedTensorSlice[IOSpec[_, _].MutableInput, static_spec=atomic_counters.static_spec]): Atomic counters coordinating work across thread blocks during the combine phase.
recv_ptrs (ManagedTensorSlice[IOSpec[_, _].Input, static_spec=recv_ptrs.static_spec]): Receive buffer pointers for the combine phase.
recv_count_ptrs (ManagedTensorSlice[IOSpec[_, _].Input, static_spec=recv_count_ptrs.static_spec]): Receive count buffer pointers tracking the number of tokens received per expert.
router_weights (ManagedTensorSlice[IOSpec[_, _].Input, static_spec=router_weights.static_spec]): Router weights for the current device used to compute the weighted sum of expert outputs. Shape [num_tokens, top_k].
context (DeviceContext): GPU device context for the current device.

Implemented traits​

Methods​

execute​

Implemented traits

Methods

`execute`