Skip to main content

Mojo function

fp8_quantize

fp8_quantize[out_dtype: DType, *, use_clamp: Bool = is_amd_gpu()](values: SIMD[dtype, size], scale_recip: Scalar[dtype]) -> SIMD[out_dtype, size]

Quantize values to FP8, optionally clamping to the representable range.

On AMD, using clamp is faster because of nan handling.

Parameters:

  • ​out_dtype (DType): The FP8 output dtype.
  • ​use_clamp (Bool): Whether to clamp to [min_finite, max_finite] before cast. Defaults to True on AMD GPU, False otherwise.

Args:

  • ​values (SIMD): Values to quantize (already normalized as needed, not yet scaled).
  • ​scale_recip (Scalar): Reciprocal of the FP8 scale factor.

Returns:

SIMD: FP8-quantized values.

Was this page helpful?