Skip to main content

Mojo function

amd_4wave_scheduled_matmul

amd_4wave_scheduled_matmul[a_type: DType, b_type: DType, c_type: DType, //, enable_swizzle: Bool = True, block_m_override: Int = 0, block_n_override: Int = 0, dump_asm_path: StringSlice[StaticConstantOrigin] = StringSlice("")](a: TileTensor[a_type, address_space=a.address_space, linear_idx_type=a.linear_idx_type, element_size=a.element_size], b: TileTensor[b_type, address_space=b.address_space, linear_idx_type=b.linear_idx_type, element_size=b.element_size], c: TileTensor[c_type, address_space=c.address_space, linear_idx_type=c.linear_idx_type, element_size=c.element_size], ctx: DeviceContext)

Launches the schedule-compiler-driven 4-wave matmul on the device.

Identical dispatch to amd_4wave_matmul (same auto-pick heuristic, same override gates, same chiplet/L2 swizzle, 1D launch grid), but invokes AMD4WaveMatmul.run with the use_framework_schedule=True comptime flag. Use this as the framework arm of an A/B against the inline arm to attribute perf gaps to op ordering vs scaffolding.

Parameters:

  • ​a_type (DType): Element type of a.
  • ​b_type (DType): Element type of b.
  • ​c_type (DType): Element type of c.
  • ​enable_swizzle (Bool): Enable LDS bank-conflict avoidance.
  • ​block_m_override (Int): If > 0, force BM to this value (must be 64 or 128).
  • ​block_n_override (Int): If > 0, force BN to this value (must be 64, 128, or 256). Default 0 uses BM=BN.
  • ​dump_asm_path (StringSlice[StaticConstantOrigin]): If non-empty, dumps the compiled GCN assembly to the given file path. Only used for ASM-level diff-debugging.

Args:

Raises:

An error if device enqueue fails.