For the complete documentation index, see llms.txt. Markdown versions of all pages are available by appending .md to any URL (e.g. /max/get-started.md).

Mojo struct

Struct_ep_dispatch_async

struct Struct_ep_dispatch_async

Registers the ep.dispatch_async graph op with the graph compiler.

Implemented traits

AnyType, ImplicitlyDeletable

Methods

`execute`

static def execute[input_dtype: DType, dispatch_dtype: DType, dispatch_scale_dtype: DType, hidden_size: Int, top_k: Int, n_experts: Int, max_token_per_rank: Int, n_gpus_per_node: Int, n_nodes: Int, dispatch_fmt_str: StringSlice[ImmStaticOrigin], //, target: StringSlice[ImmStaticOrigin]](atomic_counters: ManagedTensorSlice[IOSpec[_, _].MutableInput, static_spec=atomic_counters.static_spec], input_tokens: ManagedTensorSlice[IOSpec[_, _].Input, static_spec=input_tokens.static_spec], topk_ids: ManagedTensorSlice[IOSpec[_, _].Input, static_spec=topk_ids.static_spec], send_ptrs: ManagedTensorSlice[IOSpec[_, _].Input, static_spec=send_ptrs.static_spec], recv_ptrs: ManagedTensorSlice[IOSpec[_, _].Input, static_spec=recv_ptrs.static_spec], recv_count_ptrs: ManagedTensorSlice[IOSpec[_, _].Input, static_spec=recv_count_ptrs.static_spec], context: DeviceContext)

Execute the Expert Parallelism async dispatch kernel. Tokens are transferred in either Blockwise FP8 or BF16 format.

Parameters:

input_dtype (DType): DType of the input tokens before dispatch (inferred).
dispatch_dtype (DType): DType used for the quantized token payload during dispatch (inferred).
dispatch_scale_dtype (DType): DType of the block scales accompanying the dispatched tokens (inferred).
hidden_size (Int): Size of the model's hidden dimension (inferred).
top_k (Int): Number of experts each token is routed to (inferred).
n_experts (Int): Total number of experts across all GPUs (inferred).
max_token_per_rank (Int): Maximum number of tokens per GPU (inferred).
n_gpus_per_node (Int): Number of GPUs per node (inferred).
n_nodes (Int): Number of physical nodes (inferred).
dispatch_fmt_str (StringSlice[ImmStaticOrigin]): String selecting the dispatch token format, either "BlockwiseFP8" or "BF16" (inferred).
target (StringSlice[ImmStaticOrigin]): Compile-time device target.

Args:

atomic_counters (ManagedTensorSlice[IOSpec[_, _].MutableInput, static_spec=atomic_counters.static_spec]): Atomic counters coordinating work across thread blocks during the dispatch phase.
input_tokens (ManagedTensorSlice[IOSpec[_, _].Input, static_spec=input_tokens.static_spec]): Input tokens to dispatch to experts. Shape [num_tokens, hidden_size].
topk_ids (ManagedTensorSlice[IOSpec[_, _].Input, static_spec=topk_ids.static_spec]): Top-k expert IDs per token. Shape [num_tokens, top_k].
send_ptrs (ManagedTensorSlice[IOSpec[_, _].Input, static_spec=send_ptrs.static_spec]): Send buffer pointers for the dispatch phase.
recv_ptrs (ManagedTensorSlice[IOSpec[_, _].Input, static_spec=recv_ptrs.static_spec]): Receive buffer pointers for the dispatch phase.
recv_count_ptrs (ManagedTensorSlice[IOSpec[_, _].Input, static_spec=recv_count_ptrs.static_spec]): Receive count buffer pointers tracking tokens received per expert.
context (DeviceContext): GPU device context for the current device.

Implemented traits​

Methods​

execute​

Implemented traits

Methods

`execute`