IMPORTANT: To view this page as Markdown, append `.md` to the URL (e.g. /max/get-started.md). For the complete documentation index, see llms.txt.
Skip to main content
For the complete documentation index, see llms.txt. Markdown versions of all pages are available by appending .md to any URL (e.g. /max/get-started.md).

Mojo function

mi355x_cost_model

def mi355x_cost_model() -> TargetCostModel

MI355X cost model: production-tuned latencies.

Global loads (LOAD_A, LOAD_B): GLOBAL_MEM, 200 cycles, GLOBAL_LOAD Fragment loads (MMA_LOAD_A, MMA_LOAD_B): LDS, 20 cycles, FRAGMENT_LOAD MMA compute (COMPUTE, MMA): MMA_UNIT, 16 cycles, COMPUTE

Op tags are kernel-specific (defined in PingPongOps / DefaultMatmulOps): Ping-pong: 0=LOAD_A, 1=LOAD_B, 2=COMPUTE, 3=MMA_LOAD_A, 4=MMA_LOAD_B, 5=MMA Default: 0=LOAD_DRAM, 1=STORE_SMEM, 2=LOAD_FRAG, 3=COMPUTE

VGPR liveness hints (per-quadrant frag, fp8 MFMA 16x16x128, num_k_mmas=2): LOAD_: 0 (buffer_load_lds writes directly to LDS, no persistent register footprint) MMA_LOAD_: vgpr_def=8 — one quadrant frag is 8 VGPRs (32 fp8 / lane, packed 4-per-dword) MMA: vgpr_def=4 — one MMA writes 4 VGPRs of f32 accumulator (16 elements / 64 lanes / 4 elements per VGPR); accumulator persists across iterations so vgpr_kill=0

Returns:

TargetCostModel