IMPORTANT: To view this page as Markdown, append `.md` to the URL (e.g. /max/get-started.md). For the complete documentation index, see llms.txt.
Skip to main content
For the complete documentation index, see llms.txt. Markdown versions of all pages are available by appending .md to any URL (e.g. /max/get-started.md).

Mojo function

quantize_mxfp4_amd

def quantize_mxfp4_amd[out_dtype: DType = DType.uint8, scales_dtype: DType = DType.float8_e8m0fnu, in_dtype: DType = DType.bfloat16, //, *, num_max_threads: Int = 512](ctx: DeviceContext, output_tile: TileTensor[out_dtype, linear_idx_type=output_tile.linear_idx_type, element_size=output_tile.element_size], scales_tile: TileTensor[scales_dtype, linear_idx_type=scales_tile.linear_idx_type, element_size=scales_tile.element_size], input_tile: TileTensor[in_dtype, linear_idx_type=input_tile.linear_idx_type, element_size=input_tile.element_size])

Quantize BF16 activations to MXFP4 on AMD CDNA4 (MI355X).

Produces packed uint8 output and 2D E8M0 block scales compatible with dequant_mxfp4() and V_MFMA_SCALE_F32_16X16X128_F8F6F4.

NOTE: The 2D scales layout is a stand-in. The optimized CDNA4 layout will likely be 6D (32x32 tiles) or 7D (16x16 tiles), mirroring how SM100 uses a 5D interleaved layout for its tensor core scale feed.

Args: