Mojo function

fp8_quantize

fp8_quantize[out_dtype: DType, *, use_clamp: Bool = is_amd_gpu()](values: SIMD[values.dtype, values.size], scale_recip: Scalar[values.dtype]) -> SIMD[out_dtype, values.size]

Quantize values to FP8, optionally clamping to the representable range.

On AMD, using clamp is faster because of nan handling.

Parameters:

out_dtype (DType): The FP8 output dtype.
use_clamp (Bool): Whether to clamp to [min_finite, max_finite] before cast. Defaults to True on AMD GPU, False otherwise.

Args:

values (SIMD): Values to quantize (already normalized as needed, not yet scaled).
scale_recip (Scalar): Reciprocal of the FP8 scale factor.

Returns:

SIMD: FP8-quantized values.