For the complete documentation index, see llms.txt. Markdown versions of all pages are available by appending .md to any URL (e.g. /max/get-started.md).

Mojo struct

Struct_ep_dispatch

struct Struct_ep_dispatch

Registers the ep.dispatch graph op with the graph compiler.

Implemented traits

AnyType, ImplicitlyDeletable

Methods

`execute`

static def execute[dispatch_dtype: DType, hidden_size: Int, top_k: Int, n_experts: Int, max_token_per_rank: Int, n_gpus_per_node: Int, n_nodes: Int, fused_shared_expert: Bool, skip_a2a: Bool, allreduce_world_size: Int, //, target: StringSlice[ImmStaticOrigin]](output_tokens: ManagedTensorSlice[IOSpec[_, _].Output, static_spec=output_tokens.static_spec], row_offsets: ManagedTensorSlice[IOSpec[_, _].Output, static_spec=row_offsets.static_spec], expert_ids: ManagedTensorSlice[IOSpec[_, _].Output, static_spec=expert_ids.static_spec], src_info: ManagedTensorSlice[IOSpec[_, _].Output, static_spec=src_info.static_spec], atomic_counters: ManagedTensorSlice[IOSpec[_, _].MutableInput, static_spec=atomic_counters.static_spec], input_tokens: ManagedTensorSlice[IOSpec[_, _].Input, static_spec=input_tokens.static_spec], topk_ids: ManagedTensorSlice[IOSpec[_, _].Input, static_spec=topk_ids.static_spec], send_ptrs: ManagedTensorSlice[IOSpec[_, _].Input, static_spec=send_ptrs.static_spec], recv_ptrs: ManagedTensorSlice[IOSpec[_, _].Input, static_spec=recv_ptrs.static_spec], recv_count_ptrs: ManagedTensorSlice[IOSpec[_, _].Input, static_spec=recv_count_ptrs.static_spec], context: DeviceContext)

Execute the fused Expert Parallelism dispatch kernel.

Routes tokens to experts based on top-k IDs, sends them to peer devices in BF16 format, waits for incoming tokens, and writes them to the output buffer along with their routing metadata.

Parameters:

dispatch_dtype (DType): DType used for the token payload during dispatch (inferred).
hidden_size (Int): Size of the model's hidden dimension (inferred).
top_k (Int): Number of experts each token is routed to (inferred).
n_experts (Int): Total number of experts across all GPUs (inferred).
max_token_per_rank (Int): Maximum number of tokens any GPU can receive (inferred).
n_gpus_per_node (Int): Number of GPUs per physical node (inferred).
n_nodes (Int): Number of physical nodes in the deployment (inferred).
fused_shared_expert (Bool): Whether a shared expert is fused into the dispatch kernel (inferred).
skip_a2a (Bool): Whether to skip the all-to-all communication and send tokens only within the current device (inferred).
allreduce_world_size (Int): Number of ranks participating in the allreduce following dispatch (inferred).
target (StringSlice[ImmStaticOrigin]): Compile-time device target.

Args:

output_tokens (ManagedTensorSlice[IOSpec[_, _].Output, static_spec=output_tokens.static_spec]): Output tensor storing the received tokens in BF16 format. Shape [num_tokens, hidden_size].
row_offsets (ManagedTensorSlice[IOSpec[_, _].Output, static_spec=row_offsets.static_spec]): Output tensor storing the row offsets for the received tokens.
expert_ids (ManagedTensorSlice[IOSpec[_, _].Output, static_spec=expert_ids.static_spec]): Output tensor storing the expert ID for each received token.
src_info (ManagedTensorSlice[IOSpec[_, _].Output, static_spec=src_info.static_spec]): Output tensor recording the originating rank and token index for each received token. Shape [num_tokens, 2].
atomic_counters (ManagedTensorSlice[IOSpec[_, _].MutableInput, static_spec=atomic_counters.static_spec]): Atomic counters coordinating work across thread blocks during the dispatch phase.
input_tokens (ManagedTensorSlice[IOSpec[_, _].Input, static_spec=input_tokens.static_spec]): Input tokens to dispatch to experts. Shape [num_tokens, hidden_size].
topk_ids (ManagedTensorSlice[IOSpec[_, _].Input, static_spec=topk_ids.static_spec]): Input tensor of top-k expert IDs per token. Shape [num_tokens, top_k].
send_ptrs (ManagedTensorSlice[IOSpec[_, _].Input, static_spec=send_ptrs.static_spec]): Send buffer pointers for the dispatch phase.
recv_ptrs (ManagedTensorSlice[IOSpec[_, _].Input, static_spec=recv_ptrs.static_spec]): Receive buffer pointers for the dispatch phase.
recv_count_ptrs (ManagedTensorSlice[IOSpec[_, _].Input, static_spec=recv_count_ptrs.static_spec]): Receive count buffer pointers tracking tokens received per expert.
context (DeviceContext): GPU device context for the current device.

Implemented traits​

Methods​

execute​

Implemented traits

Methods

`execute`