For the complete documentation index, see llms.txt. Markdown versions of all pages are available by appending .md to any URL (e.g. /max/get-started.md).

Mojo struct

Struct_ep_combine_skip_a2a

struct Struct_ep_combine_skip_a2a

Registers the ep.combine.skip_a2a graph op with the graph compiler.

Implemented traits

AnyType, ImplicitlyDeletable

Methods

`execute`

static def execute[combine_dtype: DType, router_weights_dtype: DType, hidden_size: Int, top_k: Int, n_experts: Int, max_token_per_rank: Int, n_gpus_per_node: Int, n_nodes: Int, fused_shared_expert: Bool, has_epilogue_fusion: Bool, skip_a2a: Bool, allreduce_world_size: Int, //, target: StringSlice[ImmStaticOrigin]](output_tokens: ManagedTensorSlice[IOSpec[_, _].FusedOutput, static_spec=output_tokens.static_spec], atomic_counters: ManagedTensorSlice[IOSpec[_, _].MutableInput, static_spec=atomic_counters.static_spec], input_tokens: ManagedTensorSlice[IOSpec[_, _].Input, static_spec=input_tokens.static_spec], src_info: ManagedTensorSlice[IOSpec[_, _].Input, static_spec=src_info.static_spec], send_ptrs: ManagedTensorSlice[IOSpec[_, _].Input, static_spec=send_ptrs.static_spec], recv_ptrs: ManagedTensorSlice[IOSpec[_, _].Input, static_spec=recv_ptrs.static_spec], recv_count_ptrs: ManagedTensorSlice[IOSpec[_, _].Input, static_spec=recv_count_ptrs.static_spec], router_weights: ManagedTensorSlice[IOSpec[_, _].Input, static_spec=router_weights.static_spec], topk_ids: ManagedTensorSlice[IOSpec[_, _].Input, static_spec=topk_ids.static_spec], context: DeviceContext)

Execute the fused Expert Parallelism combine kernel.

Sends expert outputs back to their original devices, waits for all transfers to complete, and computes the weighted sum of routed expert outputs for each token. When skip_a2a is set, skips the all-to-all communication and reduces expert outputs locally using an allreduce of size allreduce_world_size.

Parameters:

combine_dtype (DType): DType used for the token payload during the combine phase (inferred).
router_weights_dtype (DType): DType of the router weights tensor used to compute the weighted sum (inferred).
hidden_size (Int): Size of the model's hidden dimension (inferred).
top_k (Int): Number of experts each token is routed to (inferred).
n_experts (Int): Total number of experts across all GPUs (inferred).
max_token_per_rank (Int): Maximum number of tokens any GPU can receive (inferred).
n_gpus_per_node (Int): Number of GPUs per physical node (inferred).
n_nodes (Int): Number of physical nodes in the deployment (inferred).
fused_shared_expert (Bool): Whether a shared expert is fused into the combine kernel, adding its output to the routed expert outputs (inferred).
has_epilogue_fusion (Bool): Whether to apply an elementwise epilogue function after computing the combined output (inferred).
skip_a2a (Bool): Whether to skip the all-to-all communication and reduce expert outputs locally instead (inferred).
allreduce_world_size (Int): World size for the local allreduce used when skip_a2a is set (inferred).
target (StringSlice[ImmStaticOrigin]): Compile-time device target.

Args:

output_tokens (ManagedTensorSlice[IOSpec[_, _].FusedOutput, static_spec=output_tokens.static_spec]): Fused output tensor storing the weighted sum of routed expert outputs for each token. Shape [num_tokens, hidden_size].
atomic_counters (ManagedTensorSlice[IOSpec[_, _].MutableInput, static_spec=atomic_counters.static_spec]): Atomic counters coordinating work across thread blocks during the combine phase.
input_tokens (ManagedTensorSlice[IOSpec[_, _].Input, static_spec=input_tokens.static_spec]): Expert output tokens to send back to their original devices. Shape [num_tokens, hidden_size].
src_info (ManagedTensorSlice[IOSpec[_, _].Input, static_spec=src_info.static_spec]): Source routing information from the dispatch phase recording the originating rank and token index for each token. Shape [num_tokens, 2].
send_ptrs (ManagedTensorSlice[IOSpec[_, _].Input, static_spec=send_ptrs.static_spec]): Send buffer pointers for the combine phase, one per local GPU.
recv_ptrs (ManagedTensorSlice[IOSpec[_, _].Input, static_spec=recv_ptrs.static_spec]): Receive buffer pointers for the combine phase, one per local GPU.
recv_count_ptrs (ManagedTensorSlice[IOSpec[_, _].Input, static_spec=recv_count_ptrs.static_spec]): Receive count buffer pointers tracking the number of tokens received per expert.
router_weights (ManagedTensorSlice[IOSpec[_, _].Input, static_spec=router_weights.static_spec]): Router weights for the current device used to compute the weighted sum of expert outputs. Shape [num_tokens, top_k].
topk_ids (ManagedTensorSlice[IOSpec[_, _].Input, static_spec=topk_ids.static_spec]): Top-k expert IDs selected for each token. Shape [num_tokens, top_k].
context (DeviceContext): GPU device context for the current device.

Implemented traits​

Methods​

execute​

Implemented traits

Methods

`execute`