For the complete documentation index, see llms.txt. Markdown versions of all pages are available by appending .md to any URL (e.g. /max/get-started.md).

Mojo struct

Struct_ep_combine_async

struct Struct_ep_combine_async

Registers the ep.combine_async graph op with the graph compiler.

Implemented traits

AnyType, ImplicitlyDeletable

Methods

`execute`

static def execute[combine_dtype: DType, hidden_size: Int, top_k: Int, n_experts: Int, max_token_per_rank: Int, n_gpus_per_node: Int, n_nodes: Int, //, target: StringSlice[ImmStaticOrigin]](atomic_counters: ManagedTensorSlice[IOSpec[_, _].MutableInput, static_spec=atomic_counters.static_spec], input_tokens: ManagedTensorSlice[IOSpec[_, _].Input, static_spec=input_tokens.static_spec], src_info: ManagedTensorSlice[IOSpec[_, _].Input, static_spec=src_info.static_spec], send_ptrs: ManagedTensorSlice[IOSpec[_, _].Input, static_spec=send_ptrs.static_spec], recv_ptrs: ManagedTensorSlice[IOSpec[_, _].Input, static_spec=recv_ptrs.static_spec], recv_count_ptrs: ManagedTensorSlice[IOSpec[_, _].Input, static_spec=recv_count_ptrs.static_spec], context: DeviceContext)

Execute the Expert Parallelism combine kernel.

Sends expert-processed output tokens back to their original devices asynchronously, without waiting for the transfers to complete.

Parameters:

combine_dtype (DType): DType used for the token payload during the combine phase (inferred).
hidden_size (Int): Size of the model's hidden dimension (inferred).
top_k (Int): Number of experts each token is routed to (inferred).
n_experts (Int): Total number of experts across all GPUs (inferred).
max_token_per_rank (Int): Maximum number of tokens per GPU (inferred).
n_gpus_per_node (Int): Number of GPUs per node (inferred).
n_nodes (Int): Number of physical nodes (inferred).
target (StringSlice[ImmStaticOrigin]): Compile-time device target.

Args:

atomic_counters (ManagedTensorSlice[IOSpec[_, _].MutableInput, static_spec=atomic_counters.static_spec]): Atomic counters coordinating work across thread blocks during the combine phase.
input_tokens (ManagedTensorSlice[IOSpec[_, _].Input, static_spec=input_tokens.static_spec]): Expert output tokens to send back to their original devices. Shape [num_tokens, hidden_size].
src_info (ManagedTensorSlice[IOSpec[_, _].Input, static_spec=src_info.static_spec]): Source routing information from the dispatch phase recording the originating rank and token index for each token. Shape [num_tokens, 2].
send_ptrs (ManagedTensorSlice[IOSpec[_, _].Input, static_spec=send_ptrs.static_spec]): Send buffer pointers for the combine phase.
recv_ptrs (ManagedTensorSlice[IOSpec[_, _].Input, static_spec=recv_ptrs.static_spec]): Receive buffer pointers for the combine phase.
recv_count_ptrs (ManagedTensorSlice[IOSpec[_, _].Input, static_spec=recv_count_ptrs.static_spec]): Receive count buffer pointers tracking the number of tokens received per expert.
context (DeviceContext): GPU device context for the current device.

Implemented traits​

Methods​

execute​

Implemented traits

Methods

`execute`