Mojo function
fp8_quantize
fp8_quantize[out_dtype: DType, *, use_clamp: Bool = is_amd_gpu()](values: SIMD[dtype, size], scale_recip: Scalar[dtype]) -> SIMD[out_dtype, size]
Quantize values to FP8, optionally clamping to the representable range.
On AMD, using clamp is faster because of nan handling.
Parameters:
- βout_dtype (
DType): The FP8 output dtype. - βuse_clamp (
Bool): Whether to clamp to [min_finite, max_finite] before cast. Defaults to True on AMD GPU, False otherwise.
Args:
- βvalues (
SIMD): Values to quantize (already normalized as needed, not yet scaled). - βscale_recip (
Scalar): Reciprocal of the FP8 scale factor.
Returns:
SIMD: FP8-quantized values.
Was this page helpful?
Thank you! We'll create more content like this.
Thank you for helping us improve!