For the complete documentation index, see llms.txt. Markdown versions of all pages are available by appending .md to any URL (e.g. /max/get-started.md).

Mojo struct

Struct_grouped_matmul_block_scaled_mxfp4

struct Struct_grouped_matmul_block_scaled_mxfp4[preshuffled_b: Bool = False]

MOGG wrapper for grouped block-scaled matrix multiplication.

Provides graph compiler integration for block-scaled grouped matmul operations used in Mixture of Experts (MoE) layers on AMD GPUs.

Parameters

preshuffled_b (Bool): When True, dispatches to mxfp4_grouped_matmul_amd_preb which expects B in the 5D preshuffled layout from Shuffler.preshuffle_b_5d (typically produced by the model's weight adapter at load time, e.g. Kimi K2.5). When False (default), dispatches to the dense mxfp4_grouped_matmul_amd kernel that reads B row-major. The caller is responsible for preparing B in the matching layout.

Implemented traits

AnyType, ImplicitlyDeletable

Methods

`execute`

static def execute[c_type: DType, //, target: StringSlice[ImmStaticOrigin]](c: ManagedTensorSlice[IOSpec[_, _].Output, static_spec=c.static_spec], a: ManagedTensorSlice[IOSpec[_, _].Input, static_spec=a.static_spec], b: ManagedTensorSlice[IOSpec[_, _].Input, static_spec=b.static_spec], a_scales: ManagedTensorSlice[IOSpec[_, _].Input, static_spec=a_scales.static_spec], b_scales: ManagedTensorSlice[IOSpec[_, _].Input, static_spec=b_scales.static_spec], expert_start_indices: ManagedTensorSlice[IOSpec[_, _].Input, static_spec=expert_start_indices.static_spec], expert_ids: ManagedTensorSlice[IOSpec[_, _].Input, static_spec=expert_ids.static_spec], max_num_tokens_per_expert: UInt32, num_active_experts: UInt32, estimated_total_m: UInt32, decode_grid_m_cap: UInt32, context: DeviceContext)

Executes grouped block-scaled matrix multiplication.

Computes C = A @ B^T for multiple expert groups where A and B are block-scaled (e.g. MXFP4: 4-bit floating point packed as uint8).

Parameters:

c_type (DType): The output tensor data type.
target (StringSlice[ImmStaticOrigin]): The target GPU device.

Args:

c (ManagedTensorSlice[IOSpec[_, _].Output, static_spec=c.static_spec]): The output tensor of shape (total_tokens, N).
a (ManagedTensorSlice[IOSpec[_, _].Input, static_spec=a.static_spec]): The input tensor of shape (total_tokens, K // 2).
b (ManagedTensorSlice[IOSpec[_, _].Input, static_spec=b.static_spec]): The weight tensor of shape (num_experts, N, K // 2).
a_scales (ManagedTensorSlice[IOSpec[_, _].Input, static_spec=a_scales.static_spec]): The A scale factors in 2D layout.
b_scales (ManagedTensorSlice[IOSpec[_, _].Input, static_spec=b_scales.static_spec]): The B scale factors in 3D layout.
expert_start_indices (ManagedTensorSlice[IOSpec[_, _].Input, static_spec=expert_start_indices.static_spec]): The starting token index for each expert.
expert_ids (ManagedTensorSlice[IOSpec[_, _].Input, static_spec=expert_ids.static_spec]): The expert ID for each group.
max_num_tokens_per_expert (UInt32): The maximum token count for any expert.
num_active_experts (UInt32): The number of active experts.
estimated_total_m (UInt32): Estimated total received tokens for this GPU, used by the preb dispatcher to pick the persistent vs direct kernel path. Pass 0 to default to persistent. Ignored when preshuffled_b == False.
decode_grid_m_cap (UInt32): Per-expert decode cap sizing the direct grid.y on the decode bands. 0 disables (full-stride fallback). Ignored when preshuffled_b == False.
context (DeviceContext): The device context pointer.

Parameters​

Implemented traits​

Methods​

execute​

Parameters

Implemented traits

Methods

`execute`