For the complete documentation index, see llms.txt. Markdown versions of all pages are available by appending .md to any URL (e.g. /max/get-started.md).

Mojo struct

Struct_ep_combine

struct Struct_ep_combine

Registers the ep.combine graph op with the graph compiler.

Implemented traits

AnyType, ImplicitlyDeletable

Methods

`execute`

static def execute[combine_dtype: DType, router_weights_dtype: DType, hidden_size: Int, top_k: Int, n_experts: Int, max_token_per_rank: Int, n_gpus_per_node: Int, n_nodes: Int, fused_shared_expert: Bool, has_epilogue_fusion: Bool, skip_a2a: Bool, //, target: StringSlice[ImmStaticOrigin]](output_tokens: ManagedTensorSlice[IOSpec[_, _].FusedOutput, static_spec=output_tokens.static_spec], atomic_counters: ManagedTensorSlice[IOSpec[_, _].MutableInput, static_spec=atomic_counters.static_spec], input_tokens: ManagedTensorSlice[IOSpec[_, _].Input, static_spec=input_tokens.static_spec], src_info: ManagedTensorSlice[IOSpec[_, _].Input, static_spec=src_info.static_spec], send_ptrs: ManagedTensorSlice[IOSpec[_, _].Input, static_spec=send_ptrs.static_spec], recv_ptrs: ManagedTensorSlice[IOSpec[_, _].Input, static_spec=recv_ptrs.static_spec], recv_count_ptrs: ManagedTensorSlice[IOSpec[_, _].Input, static_spec=recv_count_ptrs.static_spec], router_weights: ManagedTensorSlice[IOSpec[_, _].Input, static_spec=router_weights.static_spec], context: DeviceContext)

Execute the fused Expert Parallelism combine kernel.

Sends expert outputs back to their original devices, waits for all transfers to complete, and computes the weighted sum of routed expert outputs for each token.

Parameters:

combine_dtype (DType): DType used for the token payload during the combine phase (inferred).
router_weights_dtype (DType): DType of the router weights tensor used to compute the weighted sum (inferred).
hidden_size (Int): Size of the model's hidden dimension (inferred).
top_k (Int): Number of experts each token is routed to (inferred).
n_experts (Int): Total number of experts across all GPUs (inferred).
max_token_per_rank (Int): Maximum number of tokens any GPU can receive (inferred).
n_gpus_per_node (Int): Number of GPUs per physical node (inferred).
n_nodes (Int): Number of physical nodes in the deployment (inferred).
fused_shared_expert (Bool): Whether a shared expert is fused into the combine kernel, adding its output to the routed expert outputs (inferred).
has_epilogue_fusion (Bool): Whether to apply an elementwise epilogue function after computing the combined output (inferred).
skip_a2a (Bool): Whether to skip the all-to-all communication and send tokens only within the current device (inferred).
target (StringSlice[ImmStaticOrigin]): Compile-time device target.

Args:

output_tokens (ManagedTensorSlice[IOSpec[_, _].FusedOutput, static_spec=output_tokens.static_spec]): Fused output tensor storing the weighted sum of routed expert outputs for each token. Shape [num_tokens, hidden_size].
atomic_counters (ManagedTensorSlice[IOSpec[_, _].MutableInput, static_spec=atomic_counters.static_spec]): Atomic counters coordinating work across thread blocks during the combine phase.
input_tokens (ManagedTensorSlice[IOSpec[_, _].Input, static_spec=input_tokens.static_spec]): Expert output tokens to send back to their original devices. Shape [num_tokens, hidden_size].
src_info (ManagedTensorSlice[IOSpec[_, _].Input, static_spec=src_info.static_spec]): Source routing information from the dispatch phase recording the originating rank and token index for each token. Shape [num_tokens, 2].
send_ptrs (ManagedTensorSlice[IOSpec[_, _].Input, static_spec=send_ptrs.static_spec]): Send buffer pointers for the combine phase, one per local GPU.
recv_ptrs (ManagedTensorSlice[IOSpec[_, _].Input, static_spec=recv_ptrs.static_spec]): Receive buffer pointers for the combine phase, one per local GPU.
recv_count_ptrs (ManagedTensorSlice[IOSpec[_, _].Input, static_spec=recv_count_ptrs.static_spec]): Receive count buffer pointers tracking the number of tokens received per expert.
router_weights (ManagedTensorSlice[IOSpec[_, _].Input, static_spec=router_weights.static_spec]): Router weights for the current device used to compute the weighted sum of expert outputs. Shape [num_tokens, top_k].
context (DeviceContext): GPU device context for the current device.

Implemented traits​

Methods​

execute​

Implemented traits

Methods

`execute`