IMPORTANT: To view this page as Markdown, append `.md` to the URL (e.g. /max/get-started.md). For the complete documentation index, see llms.txt.
Skip to main content
For the complete documentation index, see llms.txt. Markdown versions of all pages are available by appending .md to any URL (e.g. /max/get-started.md).

Mojo struct

Struct_mxfp4_preshuffle_scale_4d_per_expert

struct Struct_mxfp4_preshuffle_scale_4d_per_expert

Per-step A-scale preshuffle for the AMD CDNA4 preb grouped matmul.

Takes row-major E8M0 A-scales [total_tokens, K_SCALES] and writes cell-packed scales into per-expert fixed-stride slots of size max_padded_M = align_up(max_num_tokens_per_expert, 32). The mxfp4_grouped_matmul_amd_preb kernel reads slot e * max_padded_M for expert slot e. Inactive slots and pad rows are left untouched by this kernel; the matmul's per-expert tight V# bound guards out-of-range reads.

Implemented traitsโ€‹

AnyType, ImplicitlyDestructible

Methodsโ€‹

executeโ€‹

static def execute[target: StringSlice[StaticConstantOrigin]](output: ManagedTensorSlice[Output, static_spec=output.static_spec], input: ManagedTensorSlice[Input, static_spec=input.static_spec], expert_start_indices: ManagedTensorSlice[Input, static_spec=expert_start_indices.static_spec], max_num_tokens_per_expert: UInt32, num_active_experts: UInt32, context: DeviceContext)