For the complete documentation index, see llms.txt. Markdown versions of all pages are available by appending .md to any URL (e.g. /max/get-started.md).

Mojo struct

DistributedEPCombine

struct DistributedEPCombine

Registers the mo.distributed.ep.combine graph op with the graph compiler.

Implemented traits

AnyType, ImplicitlyDeletable

Methods

`execute`

static def execute[combine_dtype: DType, router_weights_dtype: DType, hidden_size: Int, top_k: Int, n_experts: Int, max_token_per_rank: Int, n_gpus_per_node: Int, n_nodes: Int, fused_shared_expert: Bool, has_epilogue_fusion: Bool, //, target: StringSlice[ImmStaticOrigin], _trace_name: StringSlice[ImmStaticOrigin]](output_tokens: _FusedOutputVariadicTensors[static_specs=output_tokens.static_specs], input_tokens: VariadicTensors[IOSpec[_, _].Input, static_specs=input_tokens.static_specs], src_info: VariadicTensors[IOSpec[_, _].Input, static_specs=src_info.static_specs], send_ptrs: VariadicTensors[IOSpec[_, _].Input, static_specs=send_ptrs.static_specs], recv_ptrs: VariadicTensors[IOSpec[_, _].Input, static_specs=recv_ptrs.static_specs], recv_count_ptrs: VariadicTensors[IOSpec[_, _].Input, static_specs=recv_count_ptrs.static_specs], router_weights: VariadicTensors[IOSpec[_, _].Input, static_specs=router_weights.static_specs], atomic_counters: VariadicTensors[IOSpec[_, _].MutableInput, static_specs=atomic_counters.static_specs], dev_ctxs: DeviceContextArray)

Multi-device fused Expert Parallelism combine with output fusion.

Parameters:

combine_dtype (DType): DType of the combined expert output tokens.
router_weights_dtype (DType): DType of the router weights.
hidden_size (Int): Size of the model's hidden dimension.
top_k (Int): Number of experts each token is routed to.
n_experts (Int): Total number of experts across all GPUs.
max_token_per_rank (Int): Maximum number of tokens per GPU.
n_gpus_per_node (Int): Number of GPUs per node.
n_nodes (Int): Number of physical nodes.
fused_shared_expert (Bool): Whether a shared expert is fused into the combine kernel.
has_epilogue_fusion (Bool): Whether the combine output is fused with a downstream elementwise epilogue.
target (StringSlice[ImmStaticOrigin]): Compile-time device target.
_trace_name (StringSlice[ImmStaticOrigin]): Trace label for this op.

Args:

output_tokens (_FusedOutputVariadicTensors[static_specs=output_tokens.static_specs]): Fused output variadic tensors storing the combined expert outputs, one per device.
input_tokens (VariadicTensors[IOSpec[_, _].Input, static_specs=input_tokens.static_specs]): Input variadic tensors of expert-processed tokens to combine, one per device.
src_info (VariadicTensors[IOSpec[_, _].Input, static_specs=src_info.static_specs]): Source info tensors recording the originating rank and token index for each received token, one per device.
send_ptrs (VariadicTensors[IOSpec[_, _].Input, static_specs=send_ptrs.static_specs]): Send buffer pointers for the combine phase, one per device.
recv_ptrs (VariadicTensors[IOSpec[_, _].Input, static_specs=recv_ptrs.static_specs]): Receive buffer pointers for the combine phase, one per device.
recv_count_ptrs (VariadicTensors[IOSpec[_, _].Input, static_specs=recv_count_ptrs.static_specs]): Receive count buffer pointers tracking tokens received per expert, one per device.
router_weights (VariadicTensors[IOSpec[_, _].Input, static_specs=router_weights.static_specs]): Router weights used to scale each expert's contribution, one per device.
atomic_counters (VariadicTensors[IOSpec[_, _].MutableInput, static_specs=atomic_counters.static_specs]): Atomic counters coordinating work across thread blocks, one per device.
dev_ctxs (DeviceContextArray): List of GPU device contexts, one per device.

Implemented traits​

Methods​

execute​

Implemented traits

Methods

`execute`